What are the limitations for health research with restricted data collection from UK primary care?

Talk Code: 
Helen Strongman
Jennifer Campbell, Daniel Dedman, Arlene M Gallagher, Antonis Kousoulis, Wilhelmine Meeraus, Tarita Murray-Thomas, Jessie Oyinlola, Helen Strongman, Rachael Williams and Janet Valentine
Author institutions: 
Clinical Practice Research Datalink (CPRD), MHRA, 151 Buckingham Palace Road, London, SW1W 9SZ, UK


Electronic health records (EHRs) are a well-established and vital tool used in healthcare research worldwide. Research based on primary care EHRs has been used extensively to confirm drug and vaccine safety, assess uptake and effectiveness of public health policy and clinical guidance and to characterise disease and risk factors. A recent government initiative to expand existing EHR databases proposed an initial set of restrictions to the time and scope of data collection. However, the impact of such restrictions on EHR-based research is not understood. This study systematically reviewed the limitations for high impact research studies conducted with an EHR database restricted by disease area, longitudinal period and level of data sensitivity.


100 high impact studies were selected for systematic analysis. High impact was defined as publication in a top five, subject-relevant journal ranked by impact factor or referenced in national clinical guidelines. A structured questionnaire was used to evaluate the hypothetical outcome of repeating the analysis in 2016 with the following data restrictions in place: (1) retrospective data collection of specific disease areas only; (2) retrospective data collection restricted to either 6 years (2010 onwards) or 12 years (2004 onwards); (3) prospective data restricted to non-sensitive information. Outcomes were categorised as (1) unfeasible, i.e. the study could not be reproduced without the introduction of major bias; (2) feasible with study design modification; or (3) unaffected by restrictions. Studies were considered compromised if either unfeasible or requiring modifications.


Of the 100 studies identified for data extraction, 91% would be compromised and 56% would not be feasible to conduct with all restrictions in place. With retrospective restrictions on disease area alone, a marked proportion of studies would be compromised overall (74%), the majority of which (68.9%) would be unfeasible. Similarly, restricting retrospective data collection to 6 years or 12 years had a profound impact on the introduction of bias, with 67% and 22% of studies compromised, respectively. Studies were largely compromised due to limited follow-up time and the consequent introduction of bias. Restricting only the collection of sensitive data (e.g. HIV status, pregnancy and abortion data) had a lesser, but still marked, impact on study quality with 10% of studies compromised and 8% deemed unfeasible even with modifications.


Minimisation of bias and confounding is crucial to maintain the internal validity of clinical research studies and produce high quality observational research. Data collection restricted to high priority disease areas only, or restricted by retrospective time period captured had the most profound limitations on the feasibility and conduct of EHR-based research. Unfeasible studies with these restrictions include investigating the risk of venous thromboembolism during pregnancy, evaluating MMR vaccine safety and the development of risk prediction models for blindness and amputation associated with diabetes. Future initiatives should consider the profound impact of data restrictions on the ability to generate robust and applicable evidence for health research.

Submitted by: 
Lucy Carty
Funding acknowledgement: 
Funding from NIHR and MHRA. Please note that we have also submitted this abstract to Informatics For Health 2017 and have been accepted. We are submitting the abstract in addition to SAPC as we would like to reach as wide an audience as possible. The research has not yet been submitted for journal publication but this is planned.