Improving Missing Data Analysis in Distributed Research Networks
Project Final Report (PDF, 321.25 KB) Disclaimer
Disclaimer
Disclaimer details
Applying improved methods to handle missing and misclassified data across databases without the need to share data on the individual level will lead to improvements in data used for population level research.
Project Details -
Completed
-
Grant NumberR01 HS026214
-
Funding Mechanism(s)
-
AHRQ Funded Amount$1,198,767
-
Principal Investigator(s)
-
Organization
-
LocationBostonMassachusetts
-
Project Dates09/30/2018 - 09/29/2022
-
Care Setting
-
Population
-
Health Care Theme
Data are routinely collected during clinical care and captured in electronic health record (EHR) databases. These data are needed for comparative effectiveness research, patient-centered outcomes research, quality improvement assessments, and public health surveillance. For this type of population-level research, the data must generally be analyzed across multiple databases to improve statistical power for a study or generalizability of findings.
To accomplish this, data from distributed research network (DRN) architecture are commonly used. DRNs receive data from multiple databases, while allowing data partners to retain physical control of their data. One important challenge to this approach is missing data in any one of the individual databases. Missing data can be attributed to one of two reasons: 1) it was not collected—for example, the provider did not ask about smoking status, or 2) it was misclassified—for instance, an individual with asthma was incorrectly classified as not having the condition. Researchers use strategies to correct both types of missing data, but current methods need to be improved. This research successfully refined existing methods and developed new methods for handling missing data, while taking an approach not using individually identifying data.
The specific aims of the research were as follows:
- Aim 1: Apply and assess missing data methods developed in single-database settings to handle obvious and well-recognized missing data in DRNs.
- Aim 2: Apply and assess machine learning (ML) and predictive modeling techniques to address less-obvious and under-recognized missing data for select variables in DRNs.
- Aim 3: Apply and assess a comprehensive analytic approach that combines conventional missing data methods and ML techniques to address missing data in DRNs.
The researchers examined conventional and emerging methods—such as multiple imputation and ML techniques—for handling missing data from single databases and applied and validated those methods across multisite settings. These refined methods, which included the use of ML and predictive modeling, did not require the sharing of individual-level data. The methods were then tested against both simulated datasets as well as real-world claims data from EHR data.
The researchers were successful in refining methods designed to handle missing data from single databases and applying them to data coming from multiple databases in DRNs. Privacy-protecting methods were developed and successfully applied to these methods using only summary-level information in multisite settings. It is expected that, if adopted, these methods will improve the quality of data for population-level research, such as comparative effectiveness, patient safety, and patient-centered outcomes research that is done within DRNs.
Disclaimer
Disclaimer details