Picture this: you’re the head of research for a major biotechnology firm, and your team has just developed what you believe to be a superior blood test to detect early-stage ovarian cancer. I’m an FDA official, and before I allow your test to go on the market, I need evidence that it’s effective.
Me: “So, how well does your new blood test work to detect ovarian cancer?”
You: “What do you mean, how well does it work? It’s awesome! We’ve used it in clinical trials to detect hundreds of cases of ovarian cancer in patients!”
Me: “Very well, but that still doesn’t tell me how effective the test is for cancer screening. What if, for every case you detect, you miss three others? Let’s put it this way: if I’m a patient and I do have ovarian cancer and I take your blood test, what’s the probability that the test is positive?”
You: “Oh, so you’re talking about the test’s sensitivity….”
As a savvy entrepreneur but reluctant statistician, you turn to your colleague Dr. Ruth Etzioni of Fred Hutch’s Public Health Sciences Division for advice. Dr. Etzioni is a biostatistician whose research group strives to ‘get the numbers right’ when it comes to cancer control and prevention. A recent study from her group, led by former lab member Dr. Jane Lange and current graduate student Yibai Zhao, critically evaluates one common measure of screening test sensitivity—good that you brought her along!
“The mathematics behind calculating the so-called true sensitivity of a screening test couldn’t be simpler,” Etzioni begins, “You simply divide the number of people who got a positive result by the total number of people with cancer in the screening population—a numerator over a denominator. The issue arises when you realize that, practically speaking, that second number is incredibly difficult to calculate.” Indeed, knowing the number of people with cancer in a population—the number of true positives—is not only difficult, but nearly impossible. Think about it this way: if we knew the exact number of true positives, a screening program would be unnecessary! So, the true sensitivity of your ovarian cancer blood test is unknowable, forcing you to rely on a so-called proxy measurement of sensitivity—an estimate of true sensitivity which is nonetheless measurable. One common proxy measure of true sensitivity is what Etzioni and her team term empirical sensitivity.
Empirical sensitivity has the same numerator as true sensitivity—the number of screen-detected cancers. To get the empirical sensitivity, divide this number by the number of screen-detected cancers plus the number of cancers detected in between screening episodes (for example, via a biopsy from a patient who reports cancer symptoms), which we’ll call interval cancers. To drive home the point that empirical sensitivity is only an estimate of true sensitivity, consider the following scenario:
Your company trials the new ovarian cancer blood test on a cohort of 1000 patients, of which 50 truly have ovarian cancer. Let’s say that screening this population yields 25 positive results which are confirmed by biopsy—however, in the 2-year interval between screening episodes, 15 more patients present with ovarian cancer symptoms and undergo biopsies which confirm ovarian cancer.
Knowledge check! In the previous scenario, what’s the 1.) true sensitivity, and 2.) empirical sensitivity?
…
If you guessed 1.) 25/50 = 50% and 2.) 25/40 = 62.5%, you’re correct! And if you’re really paying attention, you’ll likely notice that in this case, the empirical sensitivity overestimates the true sensitivity of your blood test—good for you making your case to the FDA, but not as good for a patient deciding whether to use your blood test over others. So, now that we know empirical sensitivity is an estimate of the true sensitivity, how do we know how good of an estimate it is? The scenario above suggests that the relationship between empirical and true sensitivity ought to depend both on characteristics of the screening strategy—for example, how often you choose to screen the population—and on characteristics of the cancer itself—for instance, how long a tumor remains asymptomatic but can be detected via screening (a metric sometimes called sojourn time).
This is where Zhao and colleagues begin their recent study. “In essence,” notes Dr. Etzioni, “we sought to take the intuitive link between the accuracy of empirical sensitivity and variables like screening interval and sojourn time and quantify them using a mathematical model of cancer progression and screening.” By statistically modeling different sojourn times and screening intervals, the team found that empirical sensitivity is generally an overestimate when the sojourn time is long relative to the screening interval, and it’s generally an underestimate when the opposite is true. Interestingly, the difference between empirical and true sensitivity appears highest when the true sensitivity is relatively low. They also use their model to calculate the optimal screening interval in relation to the sojourn time and test sensitivity, where, again, they find more variance in situations where the true sensitivity is generally low. Finally, they apply their model to estimate the true sensitivity of digital mammography in screening for breast cancer by using published data from the Breast Cancer Surveillance Consortium (BCSC). Using a classical estimate for breast cancer sojourn time, they find that BSCS-reported empirical sensitivity of 0.86 is only a slight overestimate of a true 0.82 sensitivity; however, the team also notes that more contemporary sojourn time estimates correspond to much lower true sensitivities (as low as 0.58!).
In all, this study from the Etzioni group digs into the mathematical nuance behind a test performance metric that many take for granted. What should we take away from their work? “Two things,” notes Etzioni, “First, that what we call these sensitivity metrics matters. What we are calling empirical sensitivity is only one of several different proxy sensitivity metrics—pushing for an agreed-upon nomenclature is important both for the people developing the tests and those deciding to use them. And second, that it’s not enough to assume that our proxies are good estimates—mathematical due diligence is required to verify that our proxies are robust and identify variables which influence their accuracy.” For vulnerable patient populations deciding which screening tests to use, overworked regulatory agencies vetting new tests, profit-motivated biotech companies developing and rating their tests, and cautious insurers deciding which tests to cover, the numbers really do matter.
The spotlighted research was funded by the National Science Foundation.
Fred Hutch/University of Washington/Seattle Children’s Cancer Consortium member Dr. Ruth Etzioni contributed to this study.
Lange, J., Zhao, Y., Gogebakan, K. C., Olivas-Martinez, A., Ryser, M. D., Gard, C. C., & Etzioni, R. (2023). Test sensitivity in a prospective cancer screening program: A critique of a common proxy measure. Statistical Methods in Medical Research, 32(6), 1053–1063.