Impact of measurement error and sample size on the performance of multivariable risk prediction models at external validation
Risk prediction models, developed to estimate the probability of an individual developing a particular outcome, are frequently published. Few are adequately validated resulting in a large number of prediction models not used in practice. External validation entails an assessment of the performance of the model in an independent dataset, and usually assumes that the model was developed free of measurement error. The impact of either random or systematic error, and the influence of sample size at which this measurement error could become negligible on model performance assessed at external validation is unknown. This simulation study investigates the impact of the measurement error and its relationship to sample size on calibration (i.e. how close observed and predicted probabilities are, and quantified by the calibration slope and Brier score), discrimination (i.e. how well the model differentiates between individuals with and without the outcome, and quantified by the c-index and D statistic), and explained variation (e.g. R2).
QRISK are a pair of sex-specific prediction models to calculate the 10-year risk of developing cardiovascular disease that have been independently externally validated. QRISK contains body mass index (BMI) as a predictor that has the potential to contain measurement error, as its two components height and weight could be measured inaccurately. This simulation study used the 2011 version of QRISK and one of the multiply imputed datasets used for its external validation. Combinations of varying sample size and amount of random noise (used to induce measurement error) were pre-specified. The following process was then repeated 1000 times: 1) for each sample size, a subsample was selected from the entire dataset such that the proportion of events was the same as the original data set; 2) normal random noise was added to BMI but ensuring that values were between 20-40 kg/m2 (required to calculate QRISK). The following summary statistics were calculated over the 1000 simulations: bias, percentage bias, standardised bias, mean squared error, coverage, and average width of the 95% confidence interval.
Data on 1,066,127 women and 1,018,318 men registered with a UK general practice between 1 January 1993 and 20 June 2008 were used in the analysis. Median (IQR) age of participants was 47 (37-60) years for women and 45 (36-57) years for men. There were 42,224 and 51,340 incident cardiovascular cases in women and men respectively. We observed no clear impact of the amount of noise for any of the model performance measures examined, regardless of sample size.
Further investigation is required. It is possible that bias will increase as the strength of the predictor affected by error increases. The impact of error at the development stage on model performance must also be studied.