Can machine-learning algorithms enhance cardiovascular risk prediction using routinely collected primary care data?
Current standard CVD risk assessment tools used in clinical practice are derived from prospective cohort or primary care database studies. These models make assumptions that each risk factor is associated with CVD in a linear-additive relationship, over-simplifying complex interactions. Machine-learning offers advantages including the ability to detect non-linear relationships, identifying all possible interactions, and multiple training algorithms. The aim of this study was to evaluate whether machine-learning can improve prediction of CVD using a large primary care database (CPRD).
The study population consisted of patients aged between 30 to 84 years, have complete data for eight core variables (gender, age, smoking status, systolic blood pressure, blood pressure treatment, total cholesterol, HDL cholesterol, and diabetes) at baseline (1 Jan 2005). An additional 22 variables which potentially were associated with CVD were included in the analysis. The outcome was the first CVD event during the 10-year follow-up period.The study population was split into a ‘training’ cohort (70% random sample) and a ‘validation’ cohort (30% random sample). Four machine-learning algorithms were trained: logistic regression, random forest, gradient boosting machines, and neural networks. Algorithm performance was assessed in the validation cohort by the area under the receiver operating curve (AUC). Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were provided for the corresponding 7.5% threshold specified by the ACC/AHA guidelines.
From a total cohort of 378,256 patients who were free from CVD baseline, there were 24,970 incident cases (6.6%) of CVD during the 10-year follow-up period. Machine-learning algorithms resulted in significant improvements on discrimination compared to the baseline ACC/AHA model (AUC 0.728, 95% CI 0.723 – 0.735): random forest model +1.7% (AUC 0.745, 95% CI 0.739 – 0.750), logistic regression +3.3% (AUC 0.761, 95% CI 0.755 – 0.766), gradient boosting machines +3.2% (AUC 0.760, 95% CI 0.755 – 0.766), neural networks +3.6% (AUC 0.764, 95% CI 0.759 – 0.769). The ACC/AHA model predicted 4,643 cases from 7,404 cases (sensitivity 62.7%, PPV 17.1%). Random forest resulted in a net increase of 191 CVD cases (sensitivity 65.3%, PPV 17.8%). Logistic regression resulted in a net increase of 324 cases (sensitivity 67.1%, PPV 18.3%). Gradient boosting machines and neural networks resulted in a net increase of 355 (sensitivity 67.5%, PPV 18.4%) and 354 (sensitivity 67.5%, PPV 18.4%) cases correctly predicted, respectively. The ACC/AHA model predicted 53,106 non-cases from 75,585 total non-cases (specificity 70.3%, NPV 95.1%). The net increase in non-cases predicted ranged from 191 non-cases for the random forest algorithm (specificity 70.5%, NPV 95.4%) to 355 non-cases for gradient boosting machines (specificity 70.7%, NPV 95.7%).
In the era of ‘big-data’, we have demonstrated that machine-learning algorithms offer better opportunities to fully harness the information collected in routine primary care practice. Improvements in prediction accuracy of standard CVD risk prediction tools have clear clinical implications in terms of the absolute number of individuals who are correctly identified who would benefit from preventive treatment while reducing the number of individuals who are incorrectly given treatment unnecessarily.