Meta AI presents a machine learning-based model that predicts protein folding 60 times faster than state-of-the-art

Proteins are complex biological molecules that play a vital role in many essential and diverse life processes. They perform a variety of biological tasks in organisms, ranging from human vision to the complex molecular machinery that transforms solar energy into chemical energy in plants. Proteins made up of 20 different types of amino acids can fold into complex 3D structures. Because of their structure, they have more room to move around and scientists can better understand how they work, allowing them to develop strategies to mimic, modify or inhibit this behavior.

However, using the amino acid formula alone will not allow researchers to immediately determine the final structure. It can be done by simulations or experiments, but the procedure is very time consuming. Recent advances in the development of artificial intelligence could lead to a new understanding of protein structure on an evolutionary scale. The ability to predict protein structure for 200 million cataloged proteins only recently became possible. Large-scale gene sequencing research has revealed billions of protein sequences, and characterizing their structures would require a breakthrough in folding rate.

Meta AI recently announced an AI development that accelerates protein folding by using huge language models to build the first comprehensive scale database of hundreds of millions of proteins, moving in this direction. The dataset is the largest ever among various other protein structure databases and is capable of predicting over 600 million structures. Compared to current state-of-the-art protein structure prediction methodologies, language models can speed up the prediction of three-dimensional structure at the atomic level by up to 60 times.

The model based on the 15 billion parameter ESM-2 Transformer, the ESM Metagenomic Atlas (a database of predicted protein structures) and an API that allows researchers to use the model have both been made public by the ‘crew. The ability to understand the structure of billions of proteins that catalog gene sequencing technology will be made available for the first time with this breakthrough, the researchers say. Scientists can learn about the diversity of the natural world and make discoveries that could help treat disease, clean up the environment, and create renewable energy using the protein forms in this database, which scientists have yet to see.

Proteins can be compared to the text of an essay. They can be expressed as strings of letters where each character represents one of 20 amino acids, similar to how language is written. Each protein sequence forms a three-dimensional shape, which is significantly responsible for the biological activity of the protein. However, there are important and fundamental distinctions between them. Protein sequences have statistical patterns that reveal details about the folded structure of the protein.

AI is used in Scalable Scale Modeling (ESM) to learn how to interpret these patterns. A language model was trained on the sequences of millions of natural proteins in 2019 using masked language modeling, a self-supervised learning method. It helped to understand the specific details regarding the composition and purpose of proteins. The next-generation protein language model ESM-2 was built on this methodology. The team noticed that insights into the model’s internal representations that allow 3D structure prediction at the atomic level emerge when the model is scaled from parameters 8M to 15B.

Even with the resources of a large research organization, it can take years to predict protein sequences using today’s advanced computer technologies. A breakthrough in prediction speed is essential for making metagenomic-scale predictions. The researchers found that the speed of structure prediction could be increased up to 60 times by using a protein sequence language model. This is fast enough to predict the results of a full metagenomic database in weeks and is scalable for databases considerably larger than Meta’s ESM Metagenomic Atlas.

Modern structure prediction techniques require large protein datasets to be sifted through to find related sequences. For structure-associated pattern extraction techniques, a collection of evolutionarily connected sequences should be used as input. As it learns about protein sequences, the language model picks up on these evolving patterns, enabling high-resolution three-dimensional structure prediction directly from the protein sequence.

Humans can gain a new perspective on biology and understand the wide range of natural variation using AI. Even the most sophisticated computer tools have been unable to fully understand the language of proteins, which is beyond human comprehension. AI has the potential to help us understand this language. ESMFold demonstrates how AI can provide new tools for understanding the natural world and reveals connections between different domains. For example, large language models, which are driving advances in machine translation, natural language understanding, speech recognition, and image generation, can also gain deep biological insights.

According to Meta, with work on metagenomics spanning several fields, including biology, chemistry and artificial intelligence, it is crucial to collaborate, share their findings and build on the ideas of others. They anticipate that ESM-2 and the ESM Metagenomic Atlas will support researchers working to understand the evolutionary history of diseases and the effects of climate change. Meta AI works on extending language models so they can be used to create new proteins and help solve problems related to health, disease and the environment.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Evolutionary-scale prediction of atomic level protein structure with a language model'. All Credit For This Research Goes To Researchers on This Project. Check out the paper, code, tool and reference article.
Please Don't Forget To Join Our ML Subreddit


Khushboo Gupta is an intern consultant at MarktechPost. She is currently pursuing her B.Tech from Indian Institute of Technology (IIT), Goa. She is passionate about the fields of machine learning, natural language processing and web development. She likes to learn more about the technical field by participating in several challenges.


Sherry J. Basler