How can the Clinical Practice Research Datalink (CPRD) balance the requirements of applied and methodological researchers when providing linked data?

Talk Code: 
Rebecca Ghosh
Helen Strongman, Shivani Padmanabhan
Author institutions: 
The Clinical Practice Research Datalink (CPRD)


The Clinical Practice Research Datalink (CPRD) provides anonymised primary care electronic healthcare records (EHR) from NHS General Practices. These longitudinal medical and prescription data are used in epidemiological research and support real-world clinical trials. To further enhance these data, CPRD has been linked to a range of additional datasets including mortality, cancer and secondary care. In addition to the usual issues involved in using EHR for research, there are additional challenges specific to the use of linked EHR data such as identifying the denominator population and minimising errors from missed and false matches. The challenge for CPRD is not only to provide linked datasets to researchers but to provide clear guidance on its use.


Linkage is performed under appropriate governance conditions on patients from consenting practices via a trusted third party (TTP) organisation (NHS Digital). De-duplicated and cleaned identifiers (NHS number, post code, date of birth and gender) are submitted to the TTP by practices and external data controllers. The TTP use a sequential eight stage deterministic algorithm to match patients based on all or some of the identifiers. Matching steps are applied sequentially so a record matched in one step is not available in subsequent steps. Rank one (the strictest criteria) matches on exact NHS number, sex, date of birth and postcode, rank five matches on exact NHS number and postcode, while rank eight (the least strict criteria) matches on exact NHS number only. CPRD have developed a denominator file, metadata and guidance to enable high quality research.


Currently 405 of 541 active practices with 10,243,606 patients are participating in the dataset linkage scheme. CPRD provides a denominator file which includes all patients who whose identifiers were processed in the linkage and flags to indicate whether the identifiers submitted by the practice were valid. CPRD’s standard linkage dataset for each external dataset includes only records that were linked to a single patient in the external dataset and were matched on the first five ranks. This cut-off has been chosen to reduce the likelihood of false matches and together with recommendations provided in the documentation simplifies decision making for researchers. Matches on lower ranks (6 to 8) and multiple matches are not routinely provided although they are available on request to support methodological research. Documentation describes the matching process, the frequency of matching at each step, and includes guidance for applied research.


To maximise research benefit from linked data, studies must account for linkage methodologies and resulting potential errors. Data providers need to support informed decision making for applied research whilst enabling methodological research that explores linkage validity and related biases. The documentation CPRD provides enables users to make informed decisions about their study based on its context and design.

Submitted by: 
Rebecca Ghosh
Funding acknowledgement: 
The Clinical Practice Research Datalink (CPRD) is a governmental, not-for-profit research service, jointly funded by the NHS National Institute for Health Research (NIHR) and the Medicines and Healthcare products Regulatory Agency (MHRA), a part of the Department of Health.