How we use machine learning to understand proteins

When most people think of protein, their minds usually go to protein-rich foods such as steak or tofu. But protein is so much more. They are essential for the functioning and development of living things, and studying them can help improve life. For example, life-changing insulin treatments for people with diabetes are based on years of protein study.

There is still a world of information to be discovered when it comes to protein, from helping people get the health care they need to finding ways to protect plant species. Google teams are focused on studying proteins so we can achieve Google Health’s mission to help billions of people live healthier lives.

In March, we published an article about a model we developed at Google that predicts protein function and a tool that allows scientists to use this model. Since then, the protein function team has done more work in this space. We spoke with software engineer Max Bileschi to learn more about the study of proteins and the work Google is doing.

Can you give us a quick lesson on protein?

Proteins dictate much of what happens in and around us, such as how we and other organisms function.

Two things determine what a protein does: its chemical formula and its environment. For example, we know that human hemoglobin, a protein found in your blood, carries oxygen to your organs. We also know that if there are particular tiny changes in the chemical formula of hemoglobin in your body, it can trigger sickle cell anemia. Additionally, we know that blood behaves differently at different temperatures because proteins behave differently at higher temperatures.

So why did a team from Google start studying proteins?

We have the opportunity to examine how machine learning can help various fields of science. Proteins are an obvious choice because of the amazing range of functions they have in our bodies and in the world. There is a tremendous amount of public data, and while individual researchers have done great work studying specific proteins, we know that we have only scratched the surface of fully understanding the universe of proteins. It is perfectly aligned with Google’s mission to organize information and make it accessible and useful.

Sounds exciting! Tell us more about using machine learning to identify what proteins do and how it improves the status quo.

Only about 1% of proteins have been studied in the laboratory. We want to see how machine learning can help us learn more about the other 99%.

This is a difficult work. There are at least a billion proteins in the world, and they have evolved throughout history and been shaped by the same forces of natural selection that we normally think act on DNA. It is useful to understand this evolutionary relationship between proteins. The presence of a similar protein in two or more distant organisms (say humans and zebrafish) may indicate that it is useful for survival. Proteins that are closely related can perform similar functions but with small differences, such as encouraging the same chemical reaction but doing so at different temperatures. Sometimes it’s easy to tell that two proteins are closely related, but other times it’s hard. This was the first protein function annotation problem we tackled with machine learning.

Machine learning helps best when it really is assistance, does not replace, current techniques. For example, we have demonstrated that approximately 300 previously uncharacterized proteins are related to “phage capsid” proteins. These capsid proteins can help us deliver drugs to the cells that really need them. We worked with a trusted protein database, Pfam, to confirm our hypothesis, and now these proteins are listed as related to phage capsid proteins – for all the public to see – including researchers.

Step back a little. Can you explain what the Pfam Protein Family Database is? How has your team contributed to this database?

A community of scientists has built a number of tools and databases, over decades, to help classify what each different protein does. Pfam is one of the most widely used databases and classifies proteins into approximately 20,000 protein types.

This protein classification work requires both computer models and experts (called curators) to validate and improve the computer models.

Sherry J. Basler