Developing and Validating a Multivariable Lung Cancer Risk Prediction Model that identifies high-risk smokers in areas with a higher incidence of disease: ECLS and UK Biobank.

Talk Code: 
6A.1
Presenter: 
Lamorna Brown
Co-authors: 
Frank Sullivan, Tom Kelsey, Utkarsh Agrawal
Author institutions: 
University of St Andrews,

Problem

In the UK, lung cancer (LC) is a leading cause of cancer related death, accounting for 21% of all cancer related mortality. Clinical trials have found that Low-Dose Computer Tomography (LDCT) can benefit those at risk, reducing LC mortality by 20%. To aid in implementing LC screenings using LDCT, modelling has been carried out to identify an appropriate target population for screening. However, most predictive models examining risk in smokers, use trial data that contains information which may be challenging and expensive to collect. This study uses both trial and electronic health record data to investigate further possible risk factors and determine the significance of established risk factors in a lung cancer risk model for smokers.

Approach

Data on current and former smokers from deprived practices in Scotland were obtained from the Early Cancer of the Lung Scotland (ECLS) trial (N=12,139). This data was linked with the same participants administrative electronic health records. Due to the small number of LC cases, synthetic minority oversampling (SMOTE) was used to balance the dataset. Stepwise logistic regression was used to obtain measures of effect and select predictors. The model was then validated on data from the UK Biobank (N = 75,710). The same techniques were used to balance the Biobank dataset.

Findings

The model performed well with an average AUC of 0.91, a sensitivity of 81.2% and a specificity of 87.8%, in the ECLS dataset. Demographic (e.g. age, smoking status, gender) and clinical predictors (e.g. hospitalised for heart disease, hospitalised for COPD) were included in the final model. The validated model also performed well, although the AUC decreased to 0.84, with a sensitivity of 87.6% and a specificity of 67.5%.

Consequences

This study found that variables extracted from health records contributed significantly to risk model performance and accuracy. As such, the study was able to identify new risk factors for inclusion in LC risk models; whether a participant had stayed in hospital over the study period and whether they had been hospitalised for heart disease or COPD. The risk model presented here could help guide decision making over current lung cancer screening criteria.

Submitted by: 
Lamorna Brown
Funding acknowledgement: 
This research was funded by The Melville Trust for the Care and Cure of Cancer.