Learning from the past to improve SARS-CoV-2 variant discovery

From Dr. Lue Ping Zhao and collaborating labs, Public Health Sciences, Clinical Research, and Vaccine and Infectious Disease Divisions

Many researchers have taken initiative, even early during the SARS-CoV-2 pandemic, to collect sequence data of circulating SARS-CoV-2 from patient samples. This collection of data is a critical resource for surveilling viral evolution, or how the virus changes over time, and the kinetics of global virus spread for emerging variants. Dr. Lue Ping Zhao and his collaborators at the Fred Hutchinson Cancer Center, Drs. Peter Gilbert, Margaret Madeleine, Dan Geraghty, Keith Jerome, and Larry Corey, took a retrospective approach of applying a new strategy to the collected sequence data to identify how soon the SARS-CoV-2 Omicron variant could have been identified as a variant of concern (VOC). In this study, they analyzed global sequence data submitted to the Global Initiative on Sharing Avian Influenza Data (GISAID). Their findings suggest that with the new method developed by the collaborative team of researchers, the SARS-CoV-2 Omicron could have been classified as a VOC 22-days earlier than the declaration made on November 30th, 2021, for their best-case scenario prediction. Their work was published recently in JAMA Network Open.

The SARS-CoV-2 Omicron was deemed a VOC based on phylogenetic approaches that categorize virus sequences based on nucleotide and insertion or deletion mutations, organizing each patient sample into a SARS-CoV-2 “family tree”. One challenge with this approach is the inability to rapidly update or revise lineage designations to distinguish new variants. The researchers postulated that this delay could be reduced by incorporating a statistical learning strategy (SLS) into the sequence analysis pipeline. This approach would identify patient SARS-CoV-2 samples with 10 or more mutations and quantify the frequency of these variants within the sampled, global population over time to determine if these variants are expanding within the local population.

To explore this approach, the researchers applied SLS retrospectively to SARS-CoV-2 genomic sequence data collected from people over a 2-year span (Jan. 2020 – 2022). In total, 63,000 patient samples from Africa and 530,000 from the US were included in the analyses. From the sequence data, the researchers identified the SARS-CoV-2 Omicron variant predecessor that fit their criteria, a variant featuring 12 of the 28 Omicron core mutations within the virus spike gene, from a patient sample collected in South Africa with the submission date of December 31st, 2020. The next cases of Omicron within South Africa occurred in September 2021 and included variants with all 28 core Omicron mutations. Furthermore, from their analysis they estimated that the Omicron caseload percentage (OCP) or frequency of the Omicron variant reached 10% in South Africa by November 4th, 2021. This threshold of local Omicron cases was met at about 300 days after the occurrence of the Omicron predecessor. In the following days, the Omicron variant quickly became dominant in South Africa (75% of SARS-CoV-2 sequences on Nov. 18th, 2021) and other African countries. Therefore, the authors suggested that when a virus variant surpassed a OCP threshold of 10%, it could be termed a VOC as such a frequency is suggestive of rapid local expansion trending to becoming a dominant variant circulating in the local population with potential to spread.

When Omicron was declared a VOC on November 30th, 2021, there were no reported cases of the variant in the US. However, when the researchers analyzed sequence data between November 21st - 25th, 2021, they discovered 8 cases of the Omicron variant in several US states. Every sample from the US contained all 28 core Omicron mutations. Together these findings and those from the sequence data from Africa may suggest that the Omicron variant originated in South Africa and seeded one or more times in the US by cross-continental spread. Intriguingly, by using the SLS approach the researchers estimate that Omicron would have been declared a VOC on November 4th, 2021, three weeks prior to the actual date of the alert, and prior to the first Omicron sequence detected in the US as determined retrospectively. These three weeks may have had a critical impact on mitigating virus cross-continental spread. The researchers also calculated Omicron local expansion in US states from the SARS-CoV-2 sequence data. Here, they observed rapid local expansion following the first Omicron sequence detected in each state, similar to case expansion kinetics following the detection of the 28-mutation containing variant in several African countries.

SARS-CoV-2 Omicron progressive expansion in the US during December 2021. States excluded from this figure were absent of sequenced cases of Omicron that month.
SARS-CoV-2 Omicron progressive expansion in the US during December 2021. States excluded from this figure were absent of sequenced cases of Omicron that month. Image modified from published article

This retrospective analysis was based on mutations that occurred in the virus spike gene. The researchers propose that for future implementation of this strategy “it may be necessary to monitor all [polymorphisms] PM in all viral genes in addition to the spike protein and also to consider synonymous nucleotide substitutions in or outside of genes” as these “may increase what is known as viral fitness, leading to viral variants with clinical significance.” Limitations of this approach to identify emerging virus VOC include delayed or incomplete sequence data upload to the GISAID database. Submission of sequencing data is on a volunteer basis and if delayed, would limit the feasibility of detecting emerging variants quickly in real time.

Dr. Zhao concluded, “In this project, we have shown that a newly emerged variant, like Omicron, in one continent could be transmitted to another continent within months. Hence, global monitoring variants of SARS-COV-2 is essential, since this pathogen is likely to evolve continuously in years to come. Through this retrospective analysis, we have shown that applications of rigorous analytics on globally collected pathogen databases could expedite timely detections of new variants. Early detections could help biomedical researchers to investigate new variants and public health agencies to develop effective prevention strategies.”


The spotlighted research was funded by the National Institutes of Health National Institute of Allergy and Infectious Diseases.

Fred Hutch/University of Washington/Seattle Children's Cancer Consortium members Lue Ping Zhao, Peter Gilbert, Margaret Madeleine, Dan Geraghty, Keith Jerome, and Larry Corey contributed to this work.

Zhao LP, Lybrand TP, Gilbert P, Madeleine M, Payne TH, Cohen S, Geraghty DE, Jerome KR, Corey L. 2022. Application of Statistical Learning to Identify Omicron Mutations in SARS-CoV-2 Viral Genome Sequence Data From Populations in Africa and the United States. JAMA Netw Open. 5(9):e2230293.