**Article tit****le: Quantifying performance of a diagnostic test as the expected information for discrimination: Relation to the ****C****-statistic**

**From ****Statistical Methods in Medical Research**

We are entering a new era of “precision medicine” where prevention and treatment is based on tests that can stratify people by risk and predict which therapy will work best for each individual. However, as research on diagnostic tests has grown, it has become clearer that the current, widely-used statistical methods for evaluating the performance of these tests have serious limitations.

Since the 1980s the C-statistic or AUROC (Area Under the Receiver Operating Characteristic) defined as the probability that a randomly chosen pair of individuals with and without disease will be correctly classified, has been used to quantify the performance of a diagnostic test. However, the C-statistic does not tell us how the diagnostic test would perform when used to stratify people by risk. For instance, to evaluate a test for bowel cancer we might want to estimate what proportion of cancers would be missed and what proportion of people without cancer would undergo colonoscopy if the policy was that everyone with risk above some given threshold should be referred for colonoscopy. Another difficulty is that it does not quantify the incremental contribution of a biomarker to an existing diagnostic test.

In a paper published in *Statistical Methods in Medical Research*, I propose an alternative approach based on estimating the weight of evidence. The background to this work lies in unpublished studies of Alan Turing, who in 1941 at Bletchley Park investigated the distribution of weights of evidence to decide the best strategy for breaking the ENIGMA code. Turing discovered some key properties of this distribution, which were extended in 1968 by his former assistant Jack Good, by then one of the most influential Bayesian statisticians of the 20^{th} century.

Through **my article**, I demonstrate how Turing’s results can be applied to quantify predictive performance and to estimate how a diagnostic test will behave as a risk identifier. I propose that the distribution of weights of evidence and specifically the average weight of evidence should be used both by researchers to report their results on diagnostic tests or clinical prediction and by regulatory agencies that have to decide whether to license these tests in the health service, either on their own or as a companion diagnostic to a drug.

By Paul McKeigue, Professor of Genetic Epidemiology and Statistical Genetics Usher Institute of Population Health Sciences and Informatics University of Edinburgh Medical School

## Abstract

Although the

C-statistic is widely used for evaluating the performance of diagnostic tests, its limitations for evaluating the predictive performance of biomarker panels have been widely discussed. The increment inCobtained by adding a new biomarker to a predictive model has no direct interpretation, and the relevance of theC-statistic to risk stratification is not obvious. This paper proposes that theC-statistic should be replaced by the expected information for discriminating between cases and non-cases (expected weight of evidence, denoted as ), and that the strength of evidence favouring one model over another should be evaluated by cross-validation as the difference in test log-likelihoods. Contributions of independent variables to predictive performance are additive on the scale of Λ. Where the effective number of independent predictors is large, the value of Λ is sufficient to characterize fully how the predictor will stratify risk in a population with given prior probability of disease, and theC-statistic can be interpreted as a mapping of Λ to the interval from 0.5 to 1. Even where this asymptotic relationship does not hold, there is a one-to-one mapping between the distributions in cases and non-cases of the weight of evidence favouring case over non-case status, and the quantiles of these distributions can be used to calculate how the predictor will stratify risk. This proposed approach to reporting predictive performance is demonstrated by analysis of a dataset on the contribution of microbiome profile to diagnosis of colorectal cancer.

**Read this article in full here**

**Article details**

Quantifying performance of a diagnostic test as the expected information for discrimination: Relation to the C-statistic

Paul McKeigue

First Published July 6, 2018

DOI: 10.1177/0962280218776989

** **

** **

** **