How Machine Learning Can Accelerate Solutions to Protein Design Challenges
Over the past two years, machine learning has revolutionized protein structure prediction. Now three articles in Science describe a similar revolution in protein design.
In the new papers, biologists at the University of Washington School of Medicine show that machine learning can be used to create protein molecules much more precisely and quickly than before. Scientists hope this breakthrough will lead to many new vaccines, treatments, carbon capture tools and sustainable biomaterials.
“Proteins are fundamental in all of biology, but we know that all the proteins in every plant, animal and microbe represent far less than one percent of what is possible. With these new software tools, researchers should be able to find solutions to long-lasting challenges in medicine, energy and technology,” said lead author David Baker, professor of biochemistry at the University of Washington School of Medicine and recipient of a 2021 Breakthrough Prize in Life Sciences.
Proteins are often referred to as the “building blocks of life” because they are essential to the structure and function of all living things. They are involved in virtually every process that takes place inside cells, including growth, division, and repair. Proteins are made up of long chains of chemicals called amino acids. The sequence of amino acids in a protein determines its three-dimensional shape. This complex shape is crucial for the functioning of the protein.
Recently, powerful machine learning algorithms including AlphaFold and RoseTTAFold have been trained to predict the detailed shapes of natural proteins based solely on their amino acid sequences. Machine learning is a type of artificial intelligence that allows computers to learn from data without being explicitly programmed. Machine learning can be used to model complex scientific problems that are too difficult for humans to understand.
To go beyond proteins found in nature, Baker’s team members divided the challenge of protein design into three parts and used new software solutions for each.
First, a new form of protein must be generated. In an article published July 21 in the journal Science, the team showed that artificial intelligence can generate new shapes of proteins in two ways. The first, dubbed “hallucination”, is akin to DALL-E or other generative AI tools that produce output based on simple prompts. The second, called “inpainting”, is analogous to the autocomplete feature found in modern search bars.
Second, to speed up the process, the team designed a new algorithm to generate amino acid sequences. Featured in the September 15 issue of Science, this software tool, called ProteinMPNN, runs in about a second. It’s over 200 times faster than the previous best software. Its results are superior to previous tools and the software requires no expert customization to work.
“Neural networks are easy to train if you have a ton of data, but with proteins we don’t have as many examples as we would like. We had to go in and identify the characteristics of these molecules that are the most important was a bit of trial and error,” said project scientist Justas Dauparas, postdoctoral researcher at the Institute for Protein Design.
Third, the team used AlphaFold, a tool developed by Alphabet’s DeepMind, to independently assess whether the amino acid sequences they found were likely to fold into the expected shapes.
“Protein structure prediction software is part of the solution, but it alone cannot bring anything new,” Dauparas explained.
“ProteinMPNN is to protein design what AlphaFold was to protein structure prediction,” Baker added.
In another article published in Science On September 15, a team from the Baker lab confirmed that the combination of new machine learning tools could reliably generate new proteins that worked in the lab.
“We found that proteins made using ProteinMPNN were much more likely to fold as expected, and we could create very complex protein assemblies using these methods,” said project scientist Basile Wicky, researcher postdoctoral fellow at the Institute for Protein Design.
Among the new proteins made were nanoscale rings that the researchers believe could become parts for custom nanomachines. Electron microscopes were used to observe the rings, which have diameters about a billion times smaller than a poppy seed.
“This is the very beginning of machine learning in protein design. In the coming months, we will work to improve these tools to create even more dynamic and functional proteins,” Baker said.
Computing resources for this work were donated by Microsoft and Amazon Web Services.
Funding was provided by the Audacious Project at the Institute for Protein Design; Microsoft; Eric and Wendy Schmidt on the recommendation of Schmidt Futures; the DARPA Synergistic Discovery and Design project (contract HR001117S0003 FA8750-17-C-0219); the DARPA Harnessing Enzymatic Activity for Lifesaving Remedies project (contract HR001120S0052 HR0011-21-2-0012); Washington Research Foundation; Open Philanthropy Project Improving Protein Design Fund; Amgen; Alfred P. Sloan Foundation Matter-to-Life Program Grant (G-2021-16899); Donald and Jo Anne Petersen Endowment for Accelerating Advances in Alzheimer’s Disease Research; Human Frontier Science Program Interdisciplinary Fellowship (LT000395/2020-C); European Molecular Biology Organization (ALTF 139-2018), including an unpaid EMBO fellowship (ALTF 1047-2019) and a long-term EMBO fellowship (ALTF 191-2021); “La Caixa” Foundation; Howard Hughes Medical Institute, including a Hanna Gray Fellowship (GT11817); National Science Foundation (MCB 2032259, CHE-1629214, DBI 1937533, DGE-2140004); National Institutes of Health (DP5OD026389); the National Institute of Allergy and Infectious Diseases (HHSN272201700059C); National Institute on Aging (5U19AG065156); National Institute of General Medical Sciences (P30 GM124169-01, P41 GM 103533-24); National Cancer Institute (R01CA240339); Swiss National Science Foundation; Swiss National Competence Center for Molecular Systems Engineering; Swiss National Center of Competence in Chemical Biology; and the European Research Council (716058).