Coronaviruses collect changes in their genetic sequences as they replicate, and studying these changes in sequences collected over time helps scientists trace the virus’s history.
So Bloom looked at reports of genomic sequences of the virus from people infected early on to reveal patterns of how it had evolved. He didn’t find much at first.
Then he found a paper referring to a sequence dataset he hadn’t seen mentioned anywhere else. When he looked for these sequences in the most likely online data archive, however, he did not find them.
He knew that researchers can request the removal of sequences they’ve uploaded into the archive. Realizing that the data may be backed up online, he inferred the corresponding URLs and found files related to the sequences that were still present on the Google Cloud.
“I was able to determine deleted data corresponded to a study that partially sequenced 45 nasopharyngeal samples from [Wuhan] outpatients with suspected COVID-19 early in the epidemic,” he tweeted.
Combined with other clues, he eventually found 241 data files that had been uploaded and later deleted from the database. Pieced together, those files represented portions of 34 early SARS-CoV-2 samples that had not previously been widely known. But each file included just a portion of each sample’s full sequencing information.
Ultimately, Bloom reconstructed enough data to examine the partial sequences of 13 early SARS-CoV-2 cases.
What the sequences show about the early Wuhan outbreak
The 13 reconstructed sequences don’t transform what is known about the early stages of the Wuhan outbreak, and there is missing information about when and where the samples were collected. Still, they help fill in some details that edge us closer to identifying the original spillover event.
First, the data add to other evidence that the seafood market in Wuhan was not where the virus jumped from animals to humans.
Nature News wrote, “The earliest viral sequences from Wuhan are from individuals linked to the city’s Huanan Seafood Market in December 2019, which was initially thought to be where the coronavirus first jumped from animals to people. But the seafood-market sequences are more distantly related to SARS-CoV-2’s closest relatives in bats — the most likely ultimate origin of the virus — than are later sequences, including one collected in the United States.”
Dr. W. Ian Lipkin, a Columbia University epidemiologist, said by email to the Washington Post that Bloom’s paper offers “evidence of what many of us speculated — that the virus was circulating before the market outbreak. The retraction of sequence data is unprecedented and must be addressed.”
Lipkin told USA Today that "this line of inquiry may help us determine the origin of the virus and reconstruct how it spread in the earliest days of the pandemic.”
Dr. Sudhir Kumar, an evolutionary geneticist at Temple University told Nature News: “To me it seemed like Wuhan market was one of the first super-spreading events.”
Kumar added that the sequences “suggest that SARS-CoV-2 developed extensive diversity in the early stages of the pandemic in China — including in Wuhan.”
Scientists need to find more of those missing pieces from the early outbreak to make conclusions about the virus’s origins.
“Maybe our picture of what was present early in Wuhan from what has been sequenced might be somewhat biased,” Bloom told the New York Times.
No direct evidence for either origin theory
Bloom is among 17 experts who wrote a letter published May 13 in Science calling for an investigation of how the pandemic began, with a more balanced view in considering all possibilities, including transmission from animals to humans — which occurs in many new infectious diseases — as well as a lab accident.
These new data do not tip the scales toward one theory or the other, he said.
“These data provide no direct evidence to favor either a lab accident or a natural zoonosis,” said Bloom by email, with more explanation in a Twitter thread. “However, they do indicate the importance of continuing to seek new data about the origins and early spread of SARS-CoV-2.”
He told Science that it’s vital for scientists to set aside biases about the virus’s origins and study this issue with transparency:
“So many people have agendas and preconceived notions on this topic that if you open your mouth on the topic, someone’s going to take what you’ve said to support or reject some particular narrative. So the choices are either not to say anything at all, which I don’t think is useful or productive, or just to try to draw the conclusions you can and make it as transparent as possible. No matter how much people like [my new study] or don’t like it, or agree with the interpretation or disagree with the interpretation, they can at least go download it and repeat it themselves.”
Reasons for deletion
In a statement to media, the National Institutes of Health — which operates the archive that had once housed the sequence data — explained the process of deleting the sequences upon the request of the scientist who submitted them.
"The requestor indicated the sequence information had been updated, was being submitted to another database, and wanted the data removed from SRA (the Sequence Read Archive) to avoid version control issues," the NIH said in its statement, reported by USA Today. "Submitting investigators hold the rights to their data and can request withdrawal of the data."
Those reasons were cited in an email NIH sent to Bloom, which he included in his updated preprint. However, Bloom noted that he has been unable to find any indication that the sequences were in fact uploaded to any other database, as the authors claimed.
Newly discovered, but not new
Some have pointed out that the sequences aren’t new and weren’t really hidden, since they were available in a paper published in the journal Small.
Nature News reported:
“Stephen Goldstein, a virologist at the University of Utah in Salt Lake City, points out that the sequences Bloom recovered were not hidden: They are described in detail, with enough sequence information to know their evolutionary relationship to other early SARS-CoV-2 sequences, in the Small paper. ‘I don't think this preprint tells us a whole lot new, but it does bring to the forefront sequence data that has been publicly available, though under the radar,’ Goldstein says.”
Bloom asserts that it doesn’t matter that the data aren’t new; rather, the point is that people who are analyzing other SARS-CoV-2 sequences couldn’t find them.
“In the revised manuscript, I … make clear that I can't determine the authors' motives. However, I do note that I could not find any websites with updated data, & that practical consequence of deletion was that no one was aware the data existed,” Bloom tweeted.
What’s next
Bloom told BBC Science in Action, “My hope is that if this paper contributes to this discussion at all it reminds scientists that we operate best if we are looking for data and trying to analyze data, and we operate less well if we’re sort of yelling at each other about different positions with very little evidence.”
To that end, he is looking for other sequences from early in the pandemic, and he hopes other scientists join the effort.
The Wall Street Journal reported that other scientists share his interest.
“It makes us wonder if there are other sequences like these that have been purged,” said Dr. Vaughn S. Cooper, a University of Pittsburgh evolutionary biologist.
Bloom posted the data he’s found online to encourage others to do their own analyses.
“We really need to look hard and see if there is other early information about sequences that hasn’t been found,” he told the Wall Street Journal. “I intend to go through every early preprint I can find about SARS-CoV-2 and see if it describes any data that isn’t in the databases.”
He’s keeping an open mind about how what is found may change our understanding of the pandemic, as are other scientists.
“We should be prepared however, to revise these ideas and hypotheses further if and when more early sequence data emerge,” Dr. Sergei Pond, a biology professor at Temple University tweeted, calling Bloom’s preprint an “important bit of forensic bioinformatics.” He added, “I would not be surprised if these revisions are very significant (e.g., the timing of introduction).”
Bloom maintains that though understanding the origins and early spread is a scientific question, policymakers need to help too in allowing for better investigations and transparency in finding and analyzing all data possible.
“We need to figure out how SARS-CoV-2 began, because the answer will have implications on mitigating pandemics in the future,” Bloom said. “The issue of COVID-19’s origins is not going away. It is important for scientists to get in front of the issue to ensure we have done everything we can to explore this.”