Solving puzzles experimentally is enjoyable for many researchers. However, there are some puzzles that force even the best puzzlers to nominate modeling over experimental approaches to help decipher how the pieces fit together. Uncovering which proteins interact with which short peptide or protein fragment is an extremely complex puzzle. Yet, a puzzle worth solving. Specifically, defining “binders” and “non-binders” can inform on signaling within or between neighboring cells that occur under normal conditions, disease states, and in response to pathogens as a part of one’s immune response. The Bradley Lab in the Public Health Sciences Division at Fred Hutchinson Cancer Center developed a deep neural network that incorporates not only sequence data but also protein structural data to model and predict protein-peptide “binders” and “non-binders”. The researchers trained this model with data from major histocompatibility complex (MHC)-peptide recognition in adaptive immunity datasets and tested it in this same context and in others which included peptide-protein domain interactions. From their efforts, they discovered that predicting protein-peptide binders and non-binders was accurate for immune peptide recognition data and again excellent for the protein domain interaction datasets. They also noted that the structural input allowed for additional flexibility to include longer peptides as compared to another modeling approach that only used sequence data. Therefore, their findings support the use of structural and sequencing data in modeling networks of protein-peptide interactions. This work was recently published in Proceedings of the National Academy of Sciences.
Deep neural networks that incorporate multi-layered data into a trainable model have enhanced the predictive accuracy of numerous modeled systems. For example, “the deep neural network AlphaFold was trained to predict protein structures and does an amazing job,” stated Dr. Phil Bradley. Due to this advancement in protein structure predictions, the Bradley lab sought to extend the AlphaFold model to predict protein-peptide interactions, defined as “binders” or “non-binders”. To train the model, the researchers decided to use available sequence and structure datasets of peptide complexes with MHC proteins for known “binders” and “non-binder” pairs. “Extending the network in this way allows us to leverage the millions of known peptide-MHC binding interactions, in addition to the much smaller number of peptide-MHC crystal structures, to achieve generalizable models for peptide:MHC prediction,” stated Dr. Bradley. The training set included ~10,000 peptide-MHC pairs, of which, half represented “binders” and the other half “non-binders”. Additionally, experimental structures for 203 of these pairs were known and included in the training of the model. This modeling approach had a similar, excellent performance in predicting binders versus non-binders as compared to the current gold standard model, NetMHCpan, a sequence-based model for a separate dataset of ~5,000 peptide-MHC pairs not included in the training set. In this validation set of peptide-MHC pairs, the peptide lengths were 8, 9, and 10 residues which differed from the training set that only included 9 residue peptides. Excitingly, the researchers found that their model did better than the sequence-only NetMHCpan model at predicting protein-peptide interactions for the longer peptides, suggesting a benefit of the structural data for predicting the alignment of the peptide in the binding pocket of the interacting protein.
The importance of determining protein-peptide interactions extends to many areas of cellular signaling. To determine if the trained model could provide insights into the protein-peptide affinities for related interaction pairs, the researchers investigated the predictive accuracy of this model on peptide affinity for SH3 and PDZ protein domains. The researchers compared the peptide-MHC pair trained model to the original AlphaFold model using the default parameters and found that the trained model indeed increased the accuracy of predictions. Significantly, “even though we train on peptide-MHC binding, we show that the same network can also make improved predictions of peptide recognition by SH3 and PDZ domains, structurally distinct binding domains,” added Dr. Bradley.
This work demonstrates how structural data can improve modeling of complex puzzles like protein-peptide interactions. “Here we show that by adding a new ‘binding-prediction’ layer to the network, we can further train (‘fine tune’) the network to predict whether or not a peptide will bind to an MHC molecule.” This modeling strategy was applicable to predicting protein domain-peptide interactions and with further refinement, may be able to predict relative binding affinities of protein-peptide pairs. Furthermore, Dr. Bradley commented, “We are excited to see whether the same approach could be applied to the much more challenging question of how T cell receptors (TCRs) recognize peptide-MHC epitopes. TCR sequences and structures are highly diverse, as are their binding complexes with peptide-MHC, and there are far fewer examples of validated interactions than for peptide-MHC. All that makes the problem harder, but the lack of experimental data, in particular, means that a structural approach may have advantages over the more data-intensive sequence-based machine learning predictors that other groups are developing.”
The spotlighted research was funded by AWS, Microsoft, the Audacious Project at the Institute for Protein Design, HHMI, the National Institutes of Health, and Jane Coffin Childs Memorial Fund for Medical Research.
Fred Hutch/University of Washington/Seattle Children's Cancer Consortium members David Baker and Phil Bradley contributed to this work.
Motmaen A, Dauparas J, Baek M, Abedi MH, Baker D, Bradley P. 2023. Peptide-binding specificity prediction using fine-tuned protein structure prediction networks. Proc Natl Acad Sci USA. Feb 28;120(9):e2216697120.