TY - JOUR
T1 - Efficient Enriching of Synthesized Relational Patient Data with Time Series Data
AU - Schiff, Simon
AU - Gehrke, Marcel
AU - Möller, Ralf
N1 - The 9th International Conference on Emerging Ubiquitous Systems and Pervasive Networks (EUSPN-2018) / The 8th International Conference on Current and Future Trends of Information and Communication Technologies in Healthcare (ICTH-2018) / Affiliated Workshops
PY - 2018
Y1 - 2018
N2 - Analysing data from electronic healthcare records allows for supporting decision making and thereby can improve healthcare. However, obtaining sufficient healthcare data required for machine learning analysis is challenging due to, e.g, privacy aspects of medical data. For machine learning tasks, carefully prepared synthesized medical records can be as good as real records, which is shown in [17]. Existing tools for medical data provision generate either relational records or streams of measurements over time, but not an appropiate combination of both. In this paper, we contribute an approach to enriching synthesized relational data with time series (longitudinal data) of real patients. We use Synthea to synthesize relational data and enrich the records with time series from the anonymized MIMIC III database. In our data integration scenario, we need to find the best match from the relational data to the time series data to obtain a sufficient amount of medical data for machine learning analyses. Our experiments show that we can enrich huge amounts of relational data with real time series data. However, without any processing optimizations, the runtime does not easily scale with the number of synthesized relational records. With several optimizations and using a distributed execution engine, such as Apache Spark SQL, we can efficiently enrich synthesized relational data with time series data.
AB - Analysing data from electronic healthcare records allows for supporting decision making and thereby can improve healthcare. However, obtaining sufficient healthcare data required for machine learning analysis is challenging due to, e.g, privacy aspects of medical data. For machine learning tasks, carefully prepared synthesized medical records can be as good as real records, which is shown in [17]. Existing tools for medical data provision generate either relational records or streams of measurements over time, but not an appropiate combination of both. In this paper, we contribute an approach to enriching synthesized relational data with time series (longitudinal data) of real patients. We use Synthea to synthesize relational data and enrich the records with time series from the anonymized MIMIC III database. In our data integration scenario, we need to find the best match from the relational data to the time series data to obtain a sufficient amount of medical data for machine learning analyses. Our experiments show that we can enrich huge amounts of relational data with real time series data. However, without any processing optimizations, the runtime does not easily scale with the number of synthesized relational records. With several optimizations and using a distributed execution engine, such as Apache Spark SQL, we can efficiently enrich synthesized relational data with time series data.
U2 - 10.1016/j.procs.2018.10.130
DO - 10.1016/j.procs.2018.10.130
M3 - Journal articles
SN - 1877-0509
VL - 141
SP - 531
EP - 538
JO - Procedia Computer Science
JF - Procedia Computer Science
ER -