The good the bad and the ugly – what do we really do when we identify the best and the worst general practices?
Problem
A number of platforms/dashboards exists for examining quality indicators in general practice. Indicators are used for a number of purposes, from pay-for-performance to quality monitoring/improvement, and reflective practice. Sometimes practices are classified according to simple ranks. However, the role of chance is often recognised, and methodologies such as funnel plots or z-scores used to identify the best and worst practices. Such approaches acknowledges that small practices inherently display more variability than larger ones. A further refinement of accounting for overdispersion is sometimes applied (for example in the methodology employed by the CQC) and such approaches are generally thought to be the gold standard technique for identifying the best and worst performers. Here we examine the performance of the overdispersed z-score methodology at different levels of reliability.
Approach
In order to understand the influence of chance on a quality indicator we have to know the performance of a general practice in the absence of chance. Since chance is ever present a general practice’s underlying performance is unknowable and so we used a simulation approach based on 100,000 units (representing general practices). We assume a normally distributed underlying unit performance (i.e. the performance that would be measured without noise, or with a very large unit sample size). Nine simulations are performed whereby noise is added to each unit, representing different sample sizes, such that the reliability of practice scores varies between 0.1 and 0.9 in steps of 0.1. We examine the distribution of underlying scores flagged as outliers using an overdispersed z-score methodology.
Findings
When reliability was low, most practices flagged as outliers had an underlying performance in the core of the distribution (e.g. for reliability=0.2, 62% of flagged units were within 1SD of the overall mean). As reliability increased, larger numbers of correctly identified extreme practices were flagged as outliers. However, reliability had to be at least 0.7 for the majority of practices flagged as outliers to have an underlying performance greater than 1.96SD from the overall mean.
Consequences
Quality indicators are often used with little regard to reliability. It is frequently assumed that use of a funnel plot or z-score methodology adequately accounts for chance variation. Our simulations show that this is not the case when reliability is low, and while overdispersed z-scores provide a good description of data, they do not avoid false detection. These findings support the view that the reliability of quality indicators should be profiled as a matter of course. Interpretation of a general practice’s performance on a quality indicator is challenging without this information.