Deleted SARS-CoV-2 sequences from early in Wuhan outbreak offer clues

graphic illustration showing RNA strand coming out of coronavirus particle — An illustration of a SARS-CoV-2 viral particle and its genetic material. Dr. Jesse Bloom of Fred Hutch said that scientists should focus on identifying and analyzing sequence data from early in the pandemic.. Getty Images stock illustration

In a report first published on the preprint server bioRxiv on June 22, Fred Hutchinson Cancer Research Center evolutionary biologist Dr. Jesse Bloom reported uncovering SARS-CoV-2 sequences from early in the Wuhan outbreak that had been deleted from a National Institutes of Health database.

In a complementary Twitter thread, he explained how he had recovered raw sequencing data from 34 samples from early Wuhan SARS-CoV-2 cases, used the data in the files to reconstruct partial sequences of 13 of those cases, and described what he learned from his analysis of them.

He wrote that the sequences support other lines of evidence that SARS-CoV-2 was circulating in Wuhan before the December 2019 outbreak in a seafood market. They do not provide evidence either for or against either a natural animal origin for the virus or an accidental lab leak, nor do they reveal the first person infected by the virus.

More early sequences are probably out there, he wrote, and scientists should focus on identifying and analyzing all available data.

“Scientists need to stay focused on data-driven study of SARSCoV2 origins / early spread. After spending the last 4 months studying this closely, I am cautiously optimistic that additional relevant data are still likely to come to light,” he tweeted June 22.

He continued, “We should therefore avoid dogmatic arguments about SARSCoV2 origin / early spread, and instead focus on following two questions: (1) How can we get more data? (2) How can we better analyze the data we have?”

Bloom’s research drew attention from media and sparked discussion among scientists and public officials. Based on scientific critiques, he posted a revised preprint on June 29. The preprint has not yet been peer-reviewed.

Here are some highlights:

How Bloom found early SARS-CoV-2 sequences

Last summer, Bloom noticed that there hadn’t been much scientific progress in understanding how and when SARS-CoV-2 began. The two plausible theories were a natural zoonosis — in which the virus evolved from bat coronaviruses into one that can infect humans — or a lab accident. With little scientific evidence to support or rule out either theory, he wanted to see more data about the earliest cases.

portrait of Jesse Bloom — Dr. Jesse Bloom studies evolution using viruses and viral proteins as models, aiming to understand how mutations in viral genes shape a pathogen's ability to infect and spread. Photo by Robert Hood / Fred Hutch News Service

Coronaviruses collect changes in their genetic sequences as they replicate, and studying these changes in sequences collected over time helps scientists trace the virus’s history.

So Bloom looked at reports of genomic sequences of the virus from people infected early on to reveal patterns of how it had evolved. He didn’t find much at first.

Then he found a paper referring to a sequence dataset he hadn’t seen mentioned anywhere else. When he looked for these sequences in the most likely online data archive, however, he did not find them.

He knew that researchers can request the removal of sequences they’ve uploaded into the archive. Realizing that the data may be backed up online, he inferred the corresponding URLs and found files related to the sequences that were still present on the Google Cloud.

“I was able to determine deleted data corresponded to a study that partially sequenced 45 nasopharyngeal samples from [Wuhan] outpatients with suspected COVID-19 early in the epidemic,” he tweeted.

Combined with other clues, he eventually found 241 data files that had been uploaded and later deleted from the database. Pieced together, those files represented portions of 34 early SARS-CoV-2 samples that had not previously been widely known. But each file included just a portion of each sample’s full sequencing information.

Ultimately, Bloom reconstructed enough data to examine the partial sequences of 13 early SARS-CoV-2 cases.

What the sequences show about the early Wuhan outbreak

The 13 reconstructed sequences don’t transform what is known about the early stages of the Wuhan outbreak, and there is missing information about when and where the samples were collected. Still, they help fill in some details that edge us closer to identifying the original spillover event.

First, the data add to other evidence that the seafood market in Wuhan was not where the virus jumped from animals to humans.

Nature News wrote, “The earliest viral sequences from Wuhan are from individuals linked to the city’s Huanan Seafood Market in December 2019, which was initially thought to be where the coronavirus first jumped from animals to people. But the seafood-market sequences are more distantly related to SARS-CoV-2’s closest relatives in bats — the most likely ultimate origin of the virus — than are later sequences, including one collected in the United States.”

Dr. W. Ian Lipkin, a Columbia University epidemiologist, said by email to the Washington Post that Bloom’s paper offers “evidence of what many of us speculated — that the virus was circulating before the market outbreak. The retraction of sequence data is unprecedented and must be addressed.”

Lipkin told USA Today that "this line of inquiry may help us determine the origin of the virus and reconstruct how it spread in the earliest days of the pandemic.”

Dr. Sudhir Kumar, an evolutionary geneticist at Temple University told Nature News: “To me it seemed like Wuhan market was one of the first super-spreading events.”

Kumar added that the sequences “suggest that SARS-CoV-2 developed extensive diversity in the early stages of the pandemic in China — including in Wuhan.”

Scientists need to find more of those missing pieces from the early outbreak to make conclusions about the virus’s origins.

“Maybe our picture of what was present early in Wuhan from what has been sequenced might be somewhat biased,” Bloom told the New York Times.

No direct evidence for either origin theory

Bloom is among 17 experts who wrote a letter published May 13 in Science calling for an investigation of how the pandemic began, with a more balanced view in considering all possibilities, including transmission from animals to humans — which occurs in many new infectious diseases — as well as a lab accident.

These new data do not tip the scales toward one theory or the other, he said.

“These data provide no direct evidence to favor either a lab accident or a natural zoonosis,” said Bloom by email, with more explanation in a Twitter thread. “However, they do indicate the importance of continuing to seek new data about the origins and early spread of SARS-CoV-2.”

He told Science that it’s vital for scientists to set aside biases about the virus’s origins and study this issue with transparency:

“So many people have agendas and preconceived notions on this topic that if you open your mouth on the topic, someone’s going to take what you’ve said to support or reject some particular narrative. So the choices are either not to say anything at all, which I don’t think is useful or productive, or just to try to draw the conclusions you can and make it as transparent as possible. No matter how much people like [my new study] or don’t like it, or agree with the interpretation or disagree with the interpretation, they can at least go download it and repeat it themselves.”

Reasons for deletion

In a statement to media, the National Institutes of Health — which operates the archive that had once housed the sequence data — explained the process of deleting the sequences upon the request of the scientist who submitted them.

"The requestor indicated the sequence information had been updated, was being submitted to another database, and wanted the data removed from SRA (the Sequence Read Archive) to avoid version control issues," the NIH said in its statement, reported by USA Today. "Submitting investigators hold the rights to their data and can request withdrawal of the data."

Those reasons were cited in an email NIH sent to Bloom, which he included in his updated preprint. However, Bloom noted that he has been unable to find any indication that the sequences were in fact uploaded to any other database, as the authors claimed.

Newly discovered, but not new

Some have pointed out that the sequences aren’t new and weren’t really hidden, since they were available in a paper published in the journal Small.

Nature News reported:

“Stephen Goldstein, a virologist at the University of Utah in Salt Lake City, points out that the sequences Bloom recovered were not hidden: They are described in detail, with enough sequence information to know their evolutionary relationship to other early SARS-CoV-2 sequences, in the Small paper. ‘I don't think this preprint tells us a whole lot new, but it does bring to the forefront sequence data that has been publicly available, though under the radar,’ Goldstein says.”

Bloom asserts that it doesn’t matter that the data aren’t new; rather, the point is that people who are analyzing other SARS-CoV-2 sequences couldn’t find them.

“In the revised manuscript, I … make clear that I can't determine the authors' motives. However, I do note that I could not find any websites with updated data, & that practical consequence of deletion was that no one was aware the data existed,” Bloom tweeted.

What’s next

Bloom told BBC Science in Action, “My hope is that if this paper contributes to this discussion at all it reminds scientists that we operate best if we are looking for data and trying to analyze data, and we operate less well if we’re sort of yelling at each other about different positions with very little evidence.”

To that end, he is looking for other sequences from early in the pandemic, and he hopes other scientists join the effort.

The Wall Street Journal reported that other scientists share his interest.

“It makes us wonder if there are other sequences like these that have been purged,” said Dr. Vaughn S. Cooper, a University of Pittsburgh evolutionary biologist.

Bloom posted the data he’s found online to encourage others to do their own analyses.

“We really need to look hard and see if there is other early information about sequences that hasn’t been found,” he told the Wall Street Journal. “I intend to go through every early preprint I can find about SARS-CoV-2 and see if it describes any data that isn’t in the databases.”

He’s keeping an open mind about how what is found may change our understanding of the pandemic, as are other scientists.

“We should be prepared however, to revise these ideas and hypotheses further if and when more early sequence data emerge,” Dr. Sergei Pond, a biology professor at Temple University tweeted, calling Bloom’s preprint an “important bit of forensic bioinformatics.” He added, “I would not be surprised if these revisions are very significant (e.g., the timing of introduction).”

Bloom maintains that though understanding the origins and early spread is a scientific question, policymakers need to help too in allowing for better investigations and transparency in finding and analyzing all data possible.

“We need to figure out how SARS-CoV-2 began, because the answer will have implications on mitigating pandemics in the future,” Bloom said. “The issue of COVID-19’s origins is not going away. It is important for scientists to get in front of the issue to ensure we have done everything we can to explore this.”

Help Us Eliminate Cancer