TY - JOUR
T1 - Learning debiased graph representations from the OMOP common data model for synthetic data generation
AU - Schulz, Nicolas Alexander
AU - AI-CARE Working Group
AU - Carus, Jasmin
AU - Wiederhold, Alexander Johannes
AU - Johanns, Ole
AU - Peters, Frederik
AU - Rath, Natalie
AU - Rausch, Katharina
AU - Holleczek, Bernd
AU - Katalinic, Alexander
AU - Gundler, Christopher
N1 - Publisher Copyright:
© The Author(s) 2024.
PY - 2024/6/22
Y1 - 2024/6/22
N2 - BACKGROUND: Generating synthetic patient data is crucial for medical research, but common approaches build up on black-box models which do not allow for expert verification or intervention. We propose a highly available method which enables synthetic data generation from real patient records in a privacy preserving and compliant fashion, is interpretable and allows for expert intervention.METHODS: Our approach ties together two established tools in medical informatics, namely OMOP as a data standard for electronic health records and Synthea as a data synthetization method. For this study, data pipelines were built which extract data from OMOP, convert them into time series format, learn temporal rules by 2 statistical algorithms (Markov chain, TARM) and 3 algorithms of causal discovery (DYNOTEARS, J-PCMCI+, LiNGAM) and map the outputs into Synthea graphs. The graphs are evaluated quantitatively by their individual and relative complexity and qualitatively by medical experts.RESULTS: The algorithms were found to learn qualitatively and quantitatively different graph representations. Whereas the Markov chain results in extremely large graphs, TARM, DYNOTEARS, and J-PCMCI+ were found to reduce the data dimension during learning. The MultiGroupDirect LiNGAM algorithm was found to not be applicable to the problem statement at hand.CONCLUSION: Only TARM and DYNOTEARS are practical algorithms for real-world data in this use case. As causal discovery is a method to debias purely statistical relationships, the gradient-based causal discovery algorithm DYNOTEARS was found to be most suitable.
AB - BACKGROUND: Generating synthetic patient data is crucial for medical research, but common approaches build up on black-box models which do not allow for expert verification or intervention. We propose a highly available method which enables synthetic data generation from real patient records in a privacy preserving and compliant fashion, is interpretable and allows for expert intervention.METHODS: Our approach ties together two established tools in medical informatics, namely OMOP as a data standard for electronic health records and Synthea as a data synthetization method. For this study, data pipelines were built which extract data from OMOP, convert them into time series format, learn temporal rules by 2 statistical algorithms (Markov chain, TARM) and 3 algorithms of causal discovery (DYNOTEARS, J-PCMCI+, LiNGAM) and map the outputs into Synthea graphs. The graphs are evaluated quantitatively by their individual and relative complexity and qualitatively by medical experts.RESULTS: The algorithms were found to learn qualitatively and quantitatively different graph representations. Whereas the Markov chain results in extremely large graphs, TARM, DYNOTEARS, and J-PCMCI+ were found to reduce the data dimension during learning. The MultiGroupDirect LiNGAM algorithm was found to not be applicable to the problem statement at hand.CONCLUSION: Only TARM and DYNOTEARS are practical algorithms for real-world data in this use case. As causal discovery is a method to debias purely statistical relationships, the gradient-based causal discovery algorithm DYNOTEARS was found to be most suitable.
UR - http://www.scopus.com/inward/record.url?scp=85196889697&partnerID=8YFLogxK
UR - https://www.mendeley.com/catalogue/8efa349b-be6b-3055-aab3-dd13f2b13328/
U2 - 10.1186/s12874-024-02257-8
DO - 10.1186/s12874-024-02257-8
M3 - Journal articles
C2 - 38909216
SN - 1471-2288
VL - 24
SP - 136
JO - BMC Medical Research Methodology
JF - BMC Medical Research Methodology
IS - 1
M1 - 136
ER -