Blog
The Importance of Data Provenance and Context in Clinical Data Registries
Leon Rozenblit, Senior Director, Product & Strategy
Aug 01, 2021

Building clinical data registries that support acquisition, curation, and dissemination of health data poses a number of unique challenges. One of them is data provenance—knowing and preserving the data’s lineage. Knowing how data was collected, by whom, from whom, and under what conditions, is critical for downstream use of research data. A usable, trustworthy data management system incorporates data provenance information.

Data Provenance Overview

Data provenance, sometimes called data lineage, is a record of where a piece of information originally comes from. Data provenance also notes where the information travels and where it is stored over time.

When health researchers know the provenance of medical data, they can trace it back to the original study or trial. This gives them the tools to figure out how relevant a given piece of information is to their health initiative. They can also note how others have applied the data in specific cases.

Traceability and Trustworthiness

Data traceability lets investigators see how information has evolved over time. Computational processes impact the data itself. This includes those specific to processing data, like sorting, aggregation, and validation.

When researchers can trace these processes, they can understand their impact more precisely.

Data provenance enables professionals to check the credibility of each piece of given information. Clinical data can be more or less trustworthy depending on its source and attributes.

Clinical trial data is more trustworthy if the trial is large, replicable, randomized, and controlled.

Individual patient data (case studies) are also useful. Linked case studies can form the basis of a hypothesis that researchers can investigate. Plausible hypotheses can then be tested when a randomized controlled trial is funded.

Data Linkage

Data linkage brings together information from different sources about a single patient. The patient’s identity is kept anonymous. Registries use an ID number rather than reveal identifying information.

Data linkage is a critical tool for research. Incomplete linkage can create bias in a system. This can inadvertently lead to incorrect conclusions in research.

There are different recommended procedures for linking files. The ideal clinical data registry will incorporate the optimal linkage processes. It will also use systems to trace data provenance.

Clinical Data Registry Design

When you are building a clinical data registry, consider the following design elements:

  • Provenance-tracking mechanisms
  • Structure
  • Granularity
  • Data connections
  • Multi-strategy integration

When you take into consideration these aspects, your registry will solve many of the challenges inherent to health data management.

Provenance Tracking Mechanisms

Provenance tracking mechanisms identify data’s origins and authenticate it. They also trace the evolution of a given piece of data. Then they note the impact of different computations (i.e., validation, sorting, aggregation) on the data.

Clinical data registries can use mechanisms such as:

  • Traceable, centralized Public Key Infrastructure (PKI) signatures
  • De-centralized signature architecture (blockchains)
  • System flow mapping

System flow mapping tracks qualitative information, and it notes incomplete data sets efficiently. Often, qualitative information is key if medical researchers want to apply information to research in a useful way.

System flow mapping has been used to improve qualitative data collection. With these improvements, one can then better communicate that data as a factor. Qualitative data is relevant in making diagnostic assessments and treatment choices.

The qualitative data, diagnosis, and treatment are part of a patient’s electronic health record (EHR). This is a valuable dataset for medical researchers.

You can use system flow mapping alongside centralized or decentralized authentication systems.

Structure and Granularity

Data models need structure and granularity to communicate the data’s context effectively. Understanding the context of complex data enables accurate interpretation and useful application.

Appropriate Granularity

Appropriate granularity is one that gives users enough detail without irrelevant detail.

How relevant a detail is changes in different contexts. For example, one study in the Journal of Biomedical Informatics shows that time notation is highly granular. Specifically, time is marked in days and hours when recording follow-up appointments.

But researchers seek information about longer-term symptom changes. They might find the information they want in those follow-up appointment notes. However, they will use less granular measurements of time in their searches.

Typically, they will seek information measured in weeks or months. Thus, clinical data registries will present data with different degrees or granularity.

Data Model Structures

Engineers have proposed, and implemented, different clinical data modeling structures. These include the following:

  • Generalized Data Model
  • Metadata models
  • Subject-based data models

Subject-based data models include Star Schema and Snowflake Schema.

Data Connections

A clinical data registry must be able to connect source data (with known provenance) to research data. Then, it needs to output both in a unified way. Cultivating these connections moves researchers towards useful applications.

Multi-Strategy Integration

Clinical data registries need to incorporate qualified clinical data reporting from many sources. To support data from varied origins, strategy integration is key. An effective registry will integrate a data-provenance strategy with a schema volatility strategy-one that will evolve and adapt to metadata changes that have occurred over time.

Subject-based data models can be discarded or revised in favor of more precise schema. Our understanding of subject-based data changes, so accounting for the possibility of schematic changes is necessary.

Integrating a schema volatility strategy lets your clinical data registry keep necessary information. It will maintain notes on the data’s origins, even when the data is reorganized.

Advance Your Healthcare Mission

When data provenance is integrated into clinical data registry systems, you can vet a piece of data’s accuracy. Make sure that your platform captures provenance information as part of the operational workflow and stores it in a way that makes data context available to enable effective data reuse.

The IQVIA Integrated Health Platform is a unified system that preserves provenance and context for all data across research operations workflows and data collection methods. Instrument definitions are easily configurable, supporting the high levels of metadata variety and volatility found in research, and allows a systematic approach to versioning and data merging when versions are changed. Research staff can configure workflows to match research operation protocols and thereby support the inherent variability in research with an elegant user experience.

Learn more about registry solutions for patient advocacy organizations. Or discover the data connections that support health care improvement initiatives of medical specialty societies.

Contact Us