What's a cancer registry?

An illustration of people spread out in clumps atop a grid intersected with lines, as if connecting. — Illustration by Getty Images

When you’re married or have a baby, your name goes onto a gift registry. When you’re diagnosed with cancer — pretty much the opposite of a gift or a joyous event — it goes into a cancer registry.

A cancer registry transcribes the nuts and bolts of your diagnosis and treatment into regional, state and/or national databases to help researchers track cancer incidence in the U.S.

While an invaluable tool for scientists, cancer registries are the lists nobody wants to be on, including the people who manage them.

“When I was first diagnosed, I realized I needed to tell my staff, so no one would be surprised,” said Dr. Stephen Schwartz, principal investigator of the Cancer Surveillance System, or CSS, a regional registry housed at Fred Hutchinson Cancer Research Center.

Data Science & Tech Series:

How are new technologies and Big Data transforming research? Read more or attend an upcoming Science Says expert roundtable.

Dr. Steve Schwartz (left) and Dr. Christopher Li, are co-principal investigators of the Cancer Surveillance System, a regional registry with data on all cancer patients diagnosed within 13 Puget Sound counties. The data provide valuable insights into trends, health disparities, and much more.

Dr. Steve Schwartz (top) and Dr. Christopher Li, are co-principal investigators of the Cancer Surveillance System, a regional registry with data on all cancer patients diagnosed within 13 Puget Sound counties. The data provide valuable insights into trends, health disparities, and much more.

The CSS tracks cancer incidence in 13 Washington counties. Once a year, the names and birthdates and tumor details of diagnosed individuals get folded into the National Cancer Institute’s SEER program, which currently includes 16 catchment areas across the nation. Though somewhat of a “seer” when it comes to spotting cancer trends, SEER is actually short for Surveillance, Epidemiology and End Results. SEER analyzes and shares key findings, then makes the data available to researchers.

Schwartz, a longtime Hutch investigator who was diagnosed with a type of brain cancer in 2016, said patient data are always de-identified before being released for research. But the certified CSS staff who collect and consolidate the pathology reports from local labs, the administrators who validate the data with hospital tumor registrars and double-check the names and dates on electronic abstracts to make sure the patients are matched correctly before they’re assigned an identifier, these people, all of whom sign confidentiality agreements, see the names of newly diagnosed cancer patients as they come through.

“I remember seeing friends in there before I was diagnosed,” Schwartz said. “I told my colleagues [about my cancer] so they’d know. It felt good to not wonder whether someone would come across my record on their computer screen and feel awkward.”

A cancer registry holds sorrow and suffering, but these cloistered records also hold valuable secrets: It’s through them we spot cancer trends, like the recent uptick of colorectal cancer cases in people under age 50 as well as disparities in cancer treatment and outcomes.

Now, the science of surveillance is changing as cancer recordkeepers are linking to new sources of information and scientists at the Hutch and elsewhere are turning to artificial intelligence to delve deeper into the data to glean even more hard-won wisdom from the millions of cancer patients found within these registries.

A screen capture of a Zoom chat with Dr. Steve Schwartz and the Advocates for Collaborative Education, a cancer patient advocacy group. — Dr. Steve Schwartz met with two patient advocacy organizations via Zoom and answered questions on the Cancer Surveillance System. Here, he talks with members from Advocates for Collaborative Education, or ACE, a global cancer patient advocacy organization. He and Dr. Ruth Etzioni also spoke with patient advocates from GRASP, Guiding Researchers and Advocates to Scientific Partnerships. Screen capture courtesy of Stacey Tinianov

Many answers, but not all … yet

There are different types of registries, but ones like SEER, the country’s first cancer registry, are population-based. They hold data on all cancer patients living in a given area when diagnosed: name, demographics, place of birth; biological data on the tumor’s location, cell type(s), stage, and a few predictive biomarkers; treatment type and, finally, the patient’s outcome.

SEER records the person’s vital status (living, deceased or unknown) by tapping into death records from the National Center for Health Statistics. But the infrastructure does not capture metastatic progression or recurrence (when the cancer spreads beyond its original site and becomes incurable), a point of contention among patient advocates and researchers alike, all of whom recognize the value of this information.

“Cancer patients want a registry that’s comprehensive, reflective and representative of us,” said Stacey Tinianov, a 47-year-old Santa Clara, Calif., breast cancer patient advocate and co-founder of Advocates for Collaborative Education, or ACE. “Whether that’s from a racial or age or ZIP code standpoint, we need to see ourselves in that registry.”

Without metastatic recurrence data, the registry is neither complete nor reflective of the patient population, she said.

“Breast cancer is viewed as a huge success from a societal perspective,” she said. “But we still lose 116 people per day from metastatic breast cancer in this country alone. We lose many more globally. We’ve got a lot more progress to make if we can’t answer the simple question of how many people are living with metastatic disease. Knowledge is power and when we know what we’re up against, we can better direct resources.”

The NCI recognizes the value of those data, too, and has launched efforts to fill the gaps, among them the recently announced Breast Cancer Recurrence Project and the ongoing RECAPSE project, led by longtime Hutch biostatistician Dr. Ruth Etzioni. In the meantime, Etzioni has used modeling and survival data from registries to glean some intel about recurrence rates, determining that 20% of early-stage breast cancer patients (who have ER+ cancers) develop metastatic disease within 20 years.

Photograph of Dr. Lynne Penberthy of the NCI standing in front of a lovely green hedge.

'As the treatment and diagnosis of cancer has evolved, registries also need to evolve in terms of relevant data. That’s been a big focus over the last 6 or 7 years since I’ve been at the NCI.'

— Dr. Lynne Penberthy, associate director for the Surveillance Research Program, National Cancer Institute

Dr. Lynne Penberthy, the NCI’s associate director for the Surveillance Research Program, which oversees SEER, is well aware of the data gaps, but stressed SEER is evolving, much like cancer research.

“As the treatment and diagnosis of cancer has evolved, registries also need to evolve in terms of relevant data,” she said. “That’s been a big focus over the last six or seven years since I’ve been at NCI.”

Syncing and linking

Capturing more detailed information by linking to other data sources is a key focus of SEER at the moment, Penberthy said.

“We need more information about the diagnostic characterization of the tumor,” she said. “We capture treatment, but not what the agents are. We need genomic data, genomic markers. We need to understand treatment, not just the initial course of therapy, but subsequent courses for those who get recurrence. Chemo — yes or no? — is not sufficient to meet patients’ and researchers’ needs. We need to capture information on biomarkers as well as treatment to help us understand why there are differences in outcomes in patients with the same types of tumors.”

By pursuing linkages with other large databases — such as pharmacy and insurance claims databases — Penberthy hopes to fill in some of these information gaps.

“Metastatic recurrence can be diagnosed through many different mechanisms,” she said. “Through imaging such as CT scans, through pathology report when patients are biopsied, as well as other modalities. You could be diagnosed by your primary care physician and they may not even be aware that they need to report to a registry. Our approach is to combine these different data sources and methods of diagnosis to get a more comprehensive picture of what’s happening with recurrence. That’s a central focus of our work.”

Hutch epidemiologist Dr. Chris Li is co-principal investigator of the CSS, which in 2018 was extended with a 10-year contract. He said gathering recurrence data from the millions of patients currently in the SEER registry would be a very “resource-intensive effort.”

“We desperately want this information. It’s critically importance to answer key questions about risk of recurrence and survival, but from a practical perspective, it’s very difficult to get this through manual data collection,” he said.

It’s also cost-prohibitive, which is why he, Schwartz and the NCI are working on alternative strategies, like novel linkages and APIs, or application programming interfaces, to link different data sources.

“We think we can extract some of this information from unstructured text within clinical notes,” Penberthy said. “This is all part of a large collaborative project we have with the Department of Energy — which has amazing computational scientists and computers — to improve and automate cancer surveillance and provide an opportunity to have less delay in reporting cancer trends.”

'The linkages provide us with complete data much more rapidly. That was our former director of information services Mary Potts’ vision, to take advantage of electronic data. That’s made CSS a model for other registries.'

— Dr. Christopher Li, co-principal investigator of the Cancer Surveillance System, or CSS

Fred Hutch biostatistican Dr. Ruth Etzioni. — At the NW Metastatic Breast Cancer Conference in 2018, Dr. Ruth Etzioni, holder of the Rosalie and Harold Rea Brown Endowed Chair, talks about cancer registries and how they are used in research. Fred Hutch file photo

‘Biggest broadest brush’

Establish in 1973, SEER is only one of the country’s national registries. In the early ‘90s, the U.S. government established a sister program, the National Program of Cancer Registries, sponsored by then-freshman Rep. Bernie Sanders of Vermont. The NPCR, operated through the Centers for Disease Control and Prevention, covers about 97% of the country: 45 states, the District of Columbia, Puerto Rico, and the U.S. Pacific Island jurisdictions. This registry is also trying to enhance its state databases.

SEER serves more as a microcosm of the U.S. population, with around 20 carefully selected state and regional registries representing our blended nation. SEER and the NPCR are both are part of a larger umbrella organization, the North America Association of Central Cancer Registries or NAACCR, which promotes uniform standards, provides education and much more, like view cancer stats with their interactive online tools.

Etzioni openly “sings the praises” of cancer registries but also realizes they’re not perfect. Registries like SEER are still extremely valuable, she said, tracking progress against cancer; identifying changes that require attention; identifying new cancer causes; generating ideas to reduce the risk of cancer and illuminating disparities between various groups.

“Cancer registry data are the biggest broad brush we have to understand cancer in the population,” she said. “We’re trying to get a broad unbiased snapshot of the population and it’s incredibly useful in understanding the state of cancer in the nation.”

One recent study using SEER data found breast cancer deaths in women under 40 have stopped declining, which researchers believe is related to the rapidly rising distant-stage breast cancer rates in the same age group.

“SEER is the most well-respected cancer registry program in the world and is the gold standard of cancer registries out there,” Li said. “And the CSS is the gold standard within that. We have an outstanding staff with a wealth of experience in cancer registration and they take their work very seriously.”

Digging into the data using AI

One reason CSS data are so valuable is because of working agreements, or linkages, it struck early on with all of the major pathology providers serving its 13-county catchment area. These linkages allow for “very rapid identification and ascertainment of data” on patients when they’re diagnosed, Li said.

“The linkages provide us with complete data much more rapidly,” he said. “That was our former Director of Information Services Mary Potts’ vision, to take advantage of electronic data. That’s made CSS a model for other registries.”

The quality of CSS data has also made it an attractive target for scientists eager to mine the registry for additional secrets.

In a 2018 study, then-Hutch physician-scientist Dr. Bernardo Goulart and colleagues used natural language processing, or NLP, to delve into the CSS and capture information on two mutations often found in non-small cell lung cancer, or NSCLC. Targeted oral therapies known as tyrosine kinase inhibitors or TKIs have been life-changers for many patients who carry common mutations in the genes ALK and EGFR, and these drugs are currently offered as a first-line treatment for patients diagnosed with stage 4 NSCLC.

But not all patients were being tested — or getting accurate tests — before their treatment. Goulart was able to tease this information out from the CSS. Another of his studies used NLP to scour the CSS for patients who’d been prescribed TKIs for their ALK- and EGFR-driven cancers, pairing it with insurance claims data to determine the financial impact of this treatment on cancer patients.

He and colleagues from HICOR, the Hutchinson Institute for Cancer Outcomes Research, found that the higher the TKI out-of-pocket costs, the more patients cut back or quit taking the medications, with Medicare patients faring worse by a significant margin.

Moving forward, Goulart said AI methods could be used to look for other actionable mutations or interventions.

“These studies could serve as a prototype to look at molecular mutations in other tumors,” he said. “We did it in lung cancer, but there’s no reason to believe you can’t do this in other tumor settings.”

New linkages, new opportunities

Schwartz, who has worked at the Hutch for over 30 years and partnered with Goulart on one of his studies, said the CSS is also an extremely valuable source for disparities research.

“We use it internally within the [Fred Hutch/University of Washington] Cancer Consortium to help identify parts of our catchment area where there are populations that might have a high burden of a particular cancer,” he said.

It can also pinpoint problems with access to care or structural bias. A recent HICOR report found that where a person lived in Washington state often determined if they lived after a cancer diagnosis.

“There’s been a major effort to expand the different ways the SEER data is being used,” Li said. “Historical linkages — like the ones between SEER and Medicare data — have been used in many, many studies. More recently, though, there have been linkages with different commercial pharmacies to try and get information on prescription medications relating to cancer or other disease. And there’s interest in trying to expand the geospatial data — linking information on addresses with neighborhood characteristics, like exposures to pollutants and other environmental exposures. That’s another opportunity.”

Li is currently collaborating with Microsoft to interrogate electronic medical record data to capture information about metastatic recurrence. He and others are also testing the accuracy of these new linkages and data extraction methods.

“We’ve found that the accuracy depends on your source of data,” he said. “With a pathology report you get so far. With radiology, you get more. A lot of recurrences are identified by imaging and there may never be a biopsy or pathology report. It does seem like it would be a substantial improvement to have pathology and radiology reports linked to a registry.”

Li said he and his collaborators are currently writing up preliminary results, but the process might provide a model for filling in all recurrence data gaps.

“We started with breast cancer because we had gold standard data from thousands of patients showing who recurred and who did not,” he said. “This could definitely be used as a model moving forward.”

Etzioni is also using workarounds to collect recurrence data. In a paper published last year, she and colleagues found that data mining of medical claims “holds promise for the streamlining of cancer registry operations” to collect metastatic recurrence data. The project also explored using patient self-report about recurrence histories, which they found to be a highly accurate source of information (published results are forthcoming).

Schwartz, both cancer survivor and scientist, said cancer registries will always be an incredibly valuable resource for patients, clinicians, and researchers, even if they haven’t yet revealed all their secrets.

“You’d be hard pressed to find any kind of regular data collection effort that was trying to maintain consistency with the past and stay relevant to the current situation that was also, with constricted funding, able to collect everything you want,” he said.

As for cancer registries’ potential in future research?

“In some senses, it’s only limited by people’s creativity,” he said.

Still curious about cancer registries? Fred Hutch biostatistician Dr. Ruth Etzioni breaks down the basics of the national SEER program and discusses her efforts to gather data on metastatic breast cancer recurrence in this Komen Puget Sound Facebook video from the 2018 Northwest Metastatic Breast Cancer Conference. Her talk starts about 27 minutes in.