Google researchers use machine learning approach to annotate protein domains


Proteins play an important role in the construction and functioning of all living organisms. Each protein is made up of a chain of amino acid building blocks. Just as an image can have many things, a protein can have multiple components, called protein domains.

Researchers have extensively studied the difficult task of understanding the relationship between a protein’s amino acid sequence and its structure or function.

Many people are familiar with DeepMind’s AlphaFold, which uses computational methods to predict protein structure from amino acid sequences. While existing methods have successfully predicted the function of hundreds of millions of proteins, many more remain unidentified. The difficulty of reliably predicting the function of widely divergent sequences is becoming increasingly serious as the volume and diversity of protein sequences in public databases rapidly increases.

The Google AI team introduces an ML technique to consistently predict protein function. The team added about 6.8 million entries to Pfam, the widely used protein family database that contains very detailed computer annotations describing the function of a protein domain. They will release it as ProtENN, which allows users to enter a sequence and receive real-time results for a projected protein function in the browser, with no configuration required.

The researchers began by developing a protein domain classification model to categorize complete protein sequences. Given the amino acid sequence of a protein domain, they define the problem as a multi-class classification task in which they predict a single tag from 17,929 classes (in the Pfam database).

The major drawback of current state-of-the-art methods is that they are based on linear sequence alignment and do not take into account interactions between amino acids in different sections of protein sequences. Proteins, on the other hand, don’t just stay as a line of amino acids. Rather, they fold in on themselves, causing strong interactions between non-adjacent amino acids.

A fundamental step in current state-of-the-art approaches is to align a new sequence of queries to one or more sequences with established functions. Due to this reliance on sequences with known functions, it is difficult to predict the function of a new sequence which is extremely distinct from any sequence with a known function. Additionally, alignment-based approaches are computationally expensive, making them prohibitively expensive to apply to large datasets like the MGnify metagenomic database, which contains over a billion protein sequences.

The team suggests that expanded convolutional neural networks (CNNs) are well suited to model non-local paired amino acid interactions. Additionally, they can be run on modern ML hardware such as GPUs. They train ProtCNN (one-dimensional CNN) and ProtENN (a set of independently trained ProtCNN models) to predict protein sequence classification.

Because proteins have evolved from common ancestors, much of their amino acid sequence is usually shared between them. It is possible that the test set will be dominated by samples quite similar to the training data if enough attention is not given. This results in models that merely “memorize” the training data rather than learning to generalize it more broadly.

Therefore, it is essential to test the performance of the model using different configurations. They stratify the accuracy of the model according to the similarity between each retained test sequence and the sequence closest to the train for each evaluation.

The team initially assesses the generalizability of the model to produce correct predictions for out-of-distribution data. For this, they used a cluster-split training and test set with protein sequence samples grouped together based on their sequence similarity. Because entire clusters are assigned to training or test sets, each test case differs by at least 75% from each training example.

They use a randomly assigned training and testing set for the second assessment to stratify the samples by difficulty in classifying them. The similarity between a test example and the closest training example and the number of training examples of the actual class are two measures of difficulty.

They test the effectiveness of the most commonly used reference models and evaluation setups, focusing on:

  • BLAST, a nearest-neighbor method that uses sequence alignment to quantify distance and infer function
  • Profile of hidden Markov models (TPHMM and phmmer).

The team collaborated with the Pfam team at the European Molecular Biology Laboratory’s European Institute of Bioinformatics (EMBL-EBI) to see if their approach could be used to tag real-world sequences. They combined the two approaches to identify more sequences than either method could alone. The resulting Pfam-N, a collection of an additional 6.8 million protein sequence annotations, was made available. The results show that ProtENN learns complementary information to alignment-based methods.

They examined these networks to determine whether the integrations were generally effective after observing the success of these classification methods and tests. For this, they created an interactive manuscript that allows users to study the relationship between model predictions, incorporations and input sequences. They found that comparable sequences were clustered together in an integration space.

Moreover, because they used an expanded CNN as their network architecture, they could use previously developed interpretability methods like class activation mapping (CAM) and adequate input subsets (SIS) to identify the important subsequences for neural network predictions. With this method, they find that their network predicts the function of a sequence by focusing on the relevant elements of the sequence.



Sherry J. Basler