Improving Health Data Quality by Assessing and Enhancing Semantic Integrity
Creating tools to automate the assessment and improvement of representational semantic integrity of terminologies in electronic health record databases will lead to improved databases with less redundancy and ambiguity and more robustness for research purposes.
Project Details -
Ongoing
-
Grant NumberR01 HS028450
-
Funding Mechanism(s)
-
AHRQ Funded Amount$1,559,815
-
Principal Investigator(s)
-
Organization
-
LocationWashingtonDistrict of Columbia
-
Project Dates07/01/2022 - 04/30/2026
-
Health Care Theme
Terminologies change over time, creating divergence in what a single code or set of codes represents. These changes include adding refinement to existing codes, the addition of new codes, and major version changes such as the transition from ICD-9CM to ICD-10CM. There are cascading implications to such changes that become a challenge in informatics, referred to as ‘representational semantic (RS) integrity.’ With the growing use of large electronic health record (EHR) data sets for research, having multiple codes or combinations of codes that represent the same phenotype creates issues in the identification of desired cohorts based on these codes. Such discrepancies likely propagate errors in analyses and findings. Ideally, terminologies should lack redundancy: that is, there should be only one code for each meaning; and lack ambiguity: that is, each code has only one meaning.
Researchers from George Washington University want to understand how to assess and improve RS integrity in longitudinal and heterogenous EHR data using automated methods. This research will develop novel data-driven methods to analyze the temporal pattern and context of EHR variables using ICD-9CM, ICD-10 CM, and SNOMED codes as use cases. The researchers will use large, longitudinal datasets, including those from the Veterans Administration's EHR clinical data warehouse; the Cerner Real-World Data (RWD)TM, a national, de-identified, person-centric data set; and the EHR data repository from a large medical center at the University of Alabama at Birmingham.
The specific aims of the research are as follows:
- Develop a data-driven approach to assess RS integrity in longitudinal EHR data.
- Develop a data-driven approach to improve RS integrity in longitudinal EHR data.
- Validate the RS integrity assessment and improvement approaches.
To detect discrepancies in codes in EHR records (“aberrant signals”), the researchers will develop statistical and deep learning models that perform multivariate time-series analysis. Contexts of codes will be analyzed over time and across data sources. From this, a semantic matching tool that generates semantic equivalent clusters for data from different time periods and facilities will be developed. The impact of predictive modeling will be assessed, and the assessment will be validated.