Reporter, terrier, retire, retro, port, trio, toe.
What do these seemingly random words have in common? They can all be formed from bits and pieces of the word repertoire (hats off to any Scrabble® fans in the audience). This concept—of adding, deleting, and rearranging the letters of a fixed character set to form a diverse collection of words—is readily apparent in our everyday use of language, but also turns out to be fundamental to the way our immune systems work. You see, in order for our B and T cells to recognize pathogens, they need B- and T-cell receptors which are specific to the dazzling variety of pathogens which we may encounter over our lifetimes, some of which our body has never encountered. How do they manage to create such a diverse repertoire of receptors? The same way that I created the words above, except instead of using letters of a word, they randomly shuffle around three types of genes (termed V-, D-, and J-genes) to form receptors in an intricate molecular ballet known as V(D)J recombination.
Each unique B- and T-cell receptor (BCR or TCR) has a variable binding region consisting of a single V-, D-, and J-gene chosen from a collection (or ‘word bank’, in keeping with the Scrabble® theme) present in the human genome. Receptor diversity results from a combination of shuffling, addition, and deletion of nucleotides comprising these V/D/J genes—in particular, deletion of small DNA sequences at the ends of these genes is catalyzed by a protein called Artemis. While scientists have a decent grasp on the mechanism behind V/D/J shuffling, the mechanism of nucleotide deletion by Artemis is much less well understood, despite being crucial for generating a diverse immune repertoire. Maggie Russell, a graduate student in the Matsen Group at Fred Hutch’s Public Health Sciences Division, is out to change that. Her recent publication in eLife, undertaken with support from Dr. Noah Simon of the UW School of Public Health and Dr. Phil Bradley of the Fred Hutch Public Health Sciences Division, takes a crack at Artemis’ mechanism of action using a statistical approach.
“Artemis first caught our eye as a result of a previous genome-wide association study (GWAS) which we undertook to identify genetic variants in individuals with affected TCR repertoires,” Russell noted. “It was previously known that Artemis was important for nucleotide trimming in this context, but as it turns out, the exact mechanism by which Artemis does its cutting is still not really understood.” Russell—a proud computational biologist—set out to fill this knowledge gap using statistical inference. Fancy mathematics aside, the concept which Russell and colleagues employed is relatively straightforward: using a previously-generated dataset comprising TCR sequences from nearly 700 individuals, the team created a probabilistic model whose ultimate goal is to predict the trimming probability of an inputted sequence. Russell then trains the model on a subset of the sequence data, instructing it to pay attention to certain interpretable features of the sequences (including DNA shape, GC-content, or sequence length). By changing the features and examining which ones affect (or don’t affect) the model’s prediction accuracy, Russell is able to identify relevant sequence features which provide clues towards Artemis’ function in vivo.
Creating a collection of models utilizing different combinations of features, Russell and colleagues found that considering three key features—nucleotide identity in a 3-base pair window around the cut, the length of sequence before and after the cut, and local GC content—led to the highest model performance. Interestingly, their model containing these features predicted cutting better than previous models—which only took nucleotide identity into account—despite having fewer overall parameters. “One of the most difficult aspects of the study was verifying that changes in model accuracy we saw were real, feature-driven effects as opposed to statistical artifacts,” explained Russell. Nevertheless, the team was able to show that their 3-feature model performed similarly with different training datasets and was able to accurately predict cutting of J-genes and other TCR receptor subtypes. Perhaps most importantly, training the model on a different TCR sequencing dataset from individuals with single nucleotide polymorphisms (SNPs) in Artemis—that Russell previously showed to affect trimming—resulted in specific changes to the expected model coefficients. “This really signaled to us that our model is rooted in the biology of Artemis, and provided further evidence that the features we chose are mechanistically relevant to the trimming process,” Russell said.
So, why bother mathematically modeling Artemis’ process instead of studying the physical protein using experimental techniques? As Russell notes, “A lot of pioneering work on Artemis has been done in model systems; however, we really care about how Artemis works in humans, a model system not amenable to a lot of the classical experimental techniques biologists rely on. Our approach lets us quantify Artemis’ function in its native context, at high throughput, and identify variables which may be relevant for V(D)J trimming by Artemis.” Importantly, it doesn’t have to be a zero-sum game; the insights which Russell and her team generate using their model can guide future experimental work, which then continues to produce data to further refine the models in a cross-fertilizing, mutually beneficial relationship. Although Artemis has been slow to give up its secrets, continuing work from Russell and others means it won’t be able to stay mysterious for much longer.
The spotlighted research was funded by the National Institutes of Health and the Howard Hughes Medical Institute.
Fred Hutch/University of Washington/Seattle Children’s Cancer Consortium members Drs. Erick Matsen and Philip Bradley contributed to this study.
Russell, M. L., Simon, N., Bradley, P., & Matsen, F. A. (2023). Statistical inference reveals the role of length, GC content, and local sequence in V(D)J nucleotide trimming. eLife, 12, e85145.