Deep transformer learning model for the diagnosis of suspected lung cancer in primary care based on sequential coded electronic health record data.

Talk Code: 
3C.3
Presenter: 
Brendan C. Delaney
Co-authors: 
Lan Wang, Younghua Yin, Ben Glampson, Robert Peach, Mauricio Barahona, Erik K Mayer.
Author institutions: 
Imperial College London, Departments of Surgery and Cancer, Mathematics

Problem

Lung cancer is the commonest cause of death from cancer in the UK, in large part due to its often-late stage of diagnosis. Only 4% of lung cancer patients present in primary care with ‘red flag’ symptoms such as haemoptysis. To diagnose patients at an earlier stage predictive models based on multiple symptoms are required. Existing epidemiological risk models do not consider the temporal relations expressed in rich sequential electronic health record data. Machine learning with deep ‘transformer’ models enable us to consider the sequential and timing aspects of the data in building predictive models. These models are the foundation of Large Language Models and ‘generative-AI’. We aimed to build such a model for lung cancer diagnosis in primary care using GP Electronic Health record (EHR).

Approach

In a nested case-control study within the Whole Systems Integrated Care (WSIC) NW London dataset, lung cancer cases were identified using Read CT v2 terms and control cases of either ‘other’ cancers or respiratory conditions. Sequential EHR data (diagnoses, symptoms, signs, referrals, test results, medication) going back three years from the date of diagnosis less the most recent 3 months were semantically pre-processed by mapping from more than 20,000 terms to 185. Analysis was performed using BERT (Bidirectional Encoder Representations from Transformers), a tool for deep learning with self-supervision and six layer by 12 attention heads. Fine tuning of the resulting ‘MedAlbert’ model was conducted with a Logistic Regression Classifier (LRC) head. Clustering of the final hidden vector CLS was explored using k-means. We split the data into 70% training and 30% internal validation. An additional regression model alone was built on the pre-processed data as a comparator.

Findings

Based on 3,303,992 registered patients from January 1981 to December 2020 there were 11,847 lung cancer cases of whom 9,629 had died. 5,789 cases and 7,240 controls were used for training and a population of 368,906 for validation. Our model achieved an AUROC of 0.965 (0.962, 0.969) with a Sensitivity of 81%, Specificity 95%, PPV 7.8% and NPV 98% based on the three year’s data prior to diagnosis less the three immediate months before. The comparator regression model achieved a PPV of 6.1% and AUROC of 0.956 (0.952-0.959). Six clusters were identified in the model including known risk factors for lung cancer such as smoking history, respiratory conditions. In addition, diabetes, obesity and alcohol intake were features contributing to the risk model.

Consequences

The QCancer Lung Model has a PPV of 1.34% at its maximum sensitivity of 77.3%. Capturing the subtle differences in presentations between cancer and non-cancer pathways to diagnosis enables much more accurate models. Future work will focus on external dataset validation and integration into GP clinical systems for evaluation.

Submitted by: 
Brendan C. Delaney
Funding acknowledgement: 
Cancer Research UK