Impact of measurement error and sample size on the performance of multivariable risk prediction models: a simulation study
Problem
Risk prediction models, developed to estimate the probability of an individual developing a particular outcome, are frequently published. Few are adequately validated resulting in a large number of prediction models not used in practice. Data are often measured with some degree of error. This error can influence the performance of a prediction model. The impact of either random or systematic error in a particular covariate, the covariate’s strength or the sample size at which this measurement error could become negligible on model performance is unknown. This simulation study investigates the impact of measurement error and its relationship to sample size and a covariate’s strength on calibration (i.e. how close observed and predicted probabilities are, and quantified by the calibration slope and Brier score), discrimination (i.e. how well the model differentiates between individuals with and without the outcome, and quantified by the c-index and D statistic), and explained variation (e.g. R2).
Approach
We performed a case study evaluating the performance of a well validated and widely used survival-based prediction model, at the external validation stage; and a general simulation evaluating the performance of logistic prediction models in a factorial experiment.QKidney® are a pair of sex-specific prediction scores of the 5-year risk of moderate-severe CKD and the 5-year risk of developing ESKF. Serum creatinine is prompt to be measured with error. Focusing on this covariate, combinations of varying sample size and amount of random noise were pre-specified. The following process was repeated 1000 times: 1) for each sample size, a subsample was randomly selected from the entire dataset such that the proportion of events was the same as the original data set; 2) normal random noise was added to creatinine; 3) model performance measures were calculated for each subsample. Simulation performance was evaluated through: bias, percentage bias, standardised bias, mean squared error, coverage, and average width of the 95% confidence interval. The factorial experiment explores a wider range of scenarios. It includes parameters defining the type of development and external validation datasets (e.g. type of covariates, events per variable, prevalence of events), and the two datasets joint measurement error distribution. The simulation process and outputs are similar to those described above.
Findings
From the case study, no clear impact of the amount of noise for any of the model performance measures examined was observed, regardless of sample size. We expect that the general simulation will show that this impact will be greater as the strength of the covariate measured with error increases.
Consequences
Further investigation looking at different data structures, type of measurement error, and even type of models is required.