ADpred: a deep learning model for accurately predicting transcription activation domains

From the Hahn Lab (Basic Sciences Division), the Nobel lab (University of Washington), and the Söding lab (Max Planck Institute for Biophysical Chemistry)

The central dogma of molecular biology dictates that information flows from DNA into RNA into protein in living cells. Transcription, a major tenet of this dogma, is the process by which information stored in DNA is transferred to RNA. This process is catalyzed by RNA polymerases, which were first mapped in bacteria in 1960s. It was not until the turn of the millennium that the structure of the active eukaryotic RNA polymerase was determined by Roger Kornberg. He was duly awarded the 2006 Nobel Prize in chemistry for his studies of the molecular basis of eukaryotic transcription in yeast.

Understanding the molecular mechanisms underlying eukaryotic transcription is of central importance for the life sciences and remains a formidable challenge despite the major advances made over the last half century. The Hahn lab (Basic Sciences Division) studies the mechanisms and regulation of transcription in eukaryotes using a combination of biochemical and computational approaches. Deciphering these regulatory mechanisms can lead to understanding the molecular mechanisms underlying many types of human disease.

Transcription factors are comprised of DNA binding domains and activation domains (ADs). Together, these domains create a bridge between the target gene and coactivators which are molecules that can regulate transcription and/or modify the chromatin. Transcription factor ADs are encoded by a wide range of seemingly unrelated amino acid sequences, and are structurally disordered, leaving the question open of how their unusual properties translate into a molecular mechanism for function. In other words, how come transcription factors are so good at recruiting the right coactivator to the right place at the right time given their unusual properties?

In previous work, Dr. Steve Hahn and colleagues sought to answer this question by focusing on the yeast transcription factor Gcn4 and its coactivator Med15, a component of the larger Mediator coactivator complex. Their work revealed that Gcn4 and Med15 indeed achieve a strong and functional interaction through tandem weak, dynamic activator-binding domains. In order to gain a broader understanding of how acidic transcription activators function, Dr. Ariel Erijman, a former postdoc in the Hahn lab, conducted a high-throughput screen for transcription activation domains in collaboration with the lab of computational biologist Johannes Söding (Max Planck Institute for Biophysical Chemistry, Göttingen, Germany). Together, they developed a deep learning predictor for acidic AD function, which they termed ADpred (https://adpred.fredhutch.org). Their work was published in a recent issue of Molecular Cell

Dr. Erijman and colleagues used a large set of random synthetic 30-mer peptides to screen for AD function in yeast and used it to train a deep neural network. This work was possible thanks to major developments in the deep learning field and the presence of a large high-quality data set of AD positive and negative sequences. Dr. Hahn reflected on the importance of their study: "Our work not only answers a longstanding question about the nature and molecular mechanism of how acidic transcription activators function, it also provides an important tool for identifying acidic ADs and pinpoints the functionally important residues." He added: "The study also opens up a new way to find other classes of ADs that may work by different mechanisms."

Diagram of the convolutional neural network. For ADpred analysis, each residue of the input 30-mer peptide contains 23 features (the 20 possible amino acids at each position, and 3 possible values for predicted secondary structure). The output is a probability of acidic AD function.
Diagram of the convolutional neural network. For ADpred analysis, each residue of the input 30-mer peptide contains 23 features (the 20 possible amino acids at each position, and 3 possible values for predicted secondary structure). The output is a probability of acidic AD function. Courtesy of Dr. Steve Hahn

Dr. Hahn emphasized the collaborative spirit that drove this project: "We had too many to count 7:30 am (Seattle time) Skype meetings with our German collaborators to discuss the project, approaches and results." Dr. Hahn added that Dr. Bill Nobel and his grad student Jacob Schreiber from UW were also invaluable collaborators. "They are deep learning experts and worked with Ariel on implementation and validation of the deep learning approach," said Dr. Hahn. This collaborative effort was spearheaded by Dr. Erijman who, according to Dr. Hahn, "did outstanding and creative work on the project, from the high throughput screen, to learning and implementing both machine learning approaches, expert data analysis, and developing the ADpred website and publicly available software."

The use of machine learning analysis enabled the development of an accurate predictor of AD function and new software tools developed in the past few years allowed them to identify the pattern of amino acids that encode AD function. Dr. Hahn explained the significance of this new approach: "Until recently deep learning was like a black box – it was hard to tell what features led to accurate predictions. New methods, originally developed for image analysis, allowed us to visualize which amino acid side chains are most important for function and led to identification of a simple repeating sequence feature in strong ADs that explains why amino acid composition is highly related to function."

Proteome analysis with ADpred suggests that acidic ADs account for less than half of metazoan ADs. While this study answers many questions about the function of acidic ADs, little is known about non-acidic ADs and how they function. "Our results and prior work imply that there are significant numbers of non-acidic transcription activators," said Dr. Hahn. "What is the nature of these activators, what are their targets, and how do their mechanisms compare with the acidic activators?" He wonders. The Hahn lab is employing multiple parallel approaches to answer these questions.

ADpred locates the transcription activation domain of MyoD, a transcription factor that programs muscle cell fate . MyoD was discovered by Hal Weintraub and colleagues.
ADpred locates the transcription activation domain of MyoD, a transcription factor that programs muscle cell fate . MyoD was discovered by Hal Weintraub and colleagues. Courtesy of Dr. Steve Hahn

This work was funded by the National Institutes of Health an IN for the Hutch award and the Deutsche Forschungsgemeinschaft focus program.

UW/Fred Hutch Cancer Consortium member Steven Hahn contributed to this work.

Erijman A, Kozlowski L, Sohrabi-Jahromi s, Fishburn J, Warfield L, Schreiber J, Noble W, Söding J, and Hahn S. 2020. A High-Throughput Screen for Transcription Activation Domains Reveals Their Sequence Features and Permits Prediction by Deep Learning. Molecular Cell. doi:10.1016/j.molcel.2020.04.020