Article title: Quantifying performance of a diagnostic test as the expected information for discrimination: Relation to the C-statistic
A key problem in medical research is to diagnose disease or predict outcome. With the development of platforms that can assay hundreds of biomarkers, a new era of “precision medicine”, is promised, in which prevention and treatment of disease wil be based on tests that can stratify people by risk and predict which therapy will work best in each individual. As research on diagnostic tests has grown, it has become clearer that widely-used statistical methods for evaluating the performance of these tests have serious limitations.
What’s wrong with the C-statistic?
To quantify the performance of a diagnostic test, researchers since the 1980s have used the C-statistic or AUROC (Area Under the Receiver Operating Characteristic), defined as the probability that a randomly chosen pair of individuals with and without disease will be correctly classified. This measures classifier performance on a scale from 0.5 (no better than chance) to 1 (perfect accuracy).
One problem with the C-statistic is that it does not tell us how the diagnostic test would perform when used to stratify people by risk: for instance to evaluate a test for bowel cancer we might want to estimate what proportion of cancers would be missed and what proportion of people without cancer would undergo colonoscopy if the policy was that everyone with risk above some given threshold should be referred for colonoscopy.
A more subtle problem is that the C-statistic does not quantify the incremental contribution of a biomarker to prediction: the increment in C-statistic when a biomarker is added to a predictive model based on clinical variables depends on whether these clinical variables differ between people with and without disease. The authors of the widely-quoted TRIPOD (Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) guidelines noted in 2015 that
“Identifying suitable measures for quantifying the incremental value of adding a predictor to an existing prediction model remains an active research area”
Proposed alternative: the expected information for discrimination (average weight of evidence)
In this paper published in Statistical Methods in Medical Research, Paul McKeigue of the University of Edinburgh proposes an alternative approach based on estimating how the weight of evidence favoring presence over absence of disease is distributed in samples of people with and without disease. If we know how the weight of evidence is distributed, we can fully characterize how the predictor will perform for classification and risk stratification.
The background to this work lies in unpublished studies of Alan Turing, who in 1941 at Bletchley Park investigated the distribution of weights of evidence to decide the best strategy for breaking the naval ENIGMA code. Turing discovered some key properties of this distribution, which were extended in 1968 by his former assistant Jack Good, by then one of the most influential Bayesian statisticians of the 20th century.
McKeigue shows that Turing’s results can be applied to quantify predictive performance and to estimate how a diagnostic test will behave as a risk stratifier. For a single summary measure of predictive performance, McKeigue proposes that the C-statistic should be replaced by the average weight of evidence (Greek letter Lambda), alternatively known as the expected information for discrimination, measured in bits. A key advantage of measuring predictive performance on this scale is that the contributions of independent variables are additive: if clinical variables contribute 2 bits and a biomarker independent of the clinical variables contributes 1 bit, the combined model will have an expected information for discrimination of 3 bits. In this framework, it is clear why the C-statistic is difficult to interpret: it maps the expected weight of evidence, which takes values from 0 to infinity, on to a scale from 0.5 to 1.
McKeigue shows how to estimate the distributions of weights of evidence so that they are mathematically consistent, and how to use these distributions to calculate how the predictor can be used to stratify people by risk.
McKeigue proposes that the distribution of weights of evidence and specifically the average weight of evidence should be used by researchers to report their results on diagnostic tests or clinical prediction, and by regulatory agencies that have to decide whether to license these tests in the health service, either on their own or as a companion diagnostic to a drug. As the C-statistic has been the most widely accepted method for likely to take some time.
Although the C-statistic is widely used for evaluating the performance of diagnostic tests, its limitations for evaluating the predictive performance of biomarker panels have been widely discussed. The increment in C obtained by adding a new biomarker to a predictive model has no direct interpretation, and the relevance of the C-statistic to risk stratification is not obvious.
This paper proposes that the C-statistic should be replaced by the expected information for discriminating between cases and non-cases (expected weight of evidence, denoted as Λ), and that the strength of evidence favouring one model over another should be evaluated by cross-validation as the difference in test log-likelihoods. Contributions of independent variables to predictive performance are additive on the scale of Λ. Where the effective number of independent predictors is large, the value of Λ is sufficient to characterize fully how the predictor will stratify risk in a population with given prior probability of disease, and the C-statistic can be interpreted as a mapping of Λ to the interval from 0.5 to 1. Even where this asymptotic relationship does not hold, there is a one-to-one mapping between the distributions in cases and non-cases of the weight of evidence favouring case over non-case status, and the quantiles of these distributions can be used to calculate how the predictor will stratify risk. This proposed approach to reporting predictive performance is demonstrated by analysis of a dataset on the contribution of microbiome profile to diagnosis of colorectal cancer.
Quantifying performance of a diagnostic test as the expected information for discrimination: Relation to the C-statistic
First Published 6 Jul 2018.
Statistical Methods in Medical Research