Could machine learning fuel a reproducibility crisis in science?

A CT scan of a tumor in human lungs. Researchers are experimenting with AI algorithms that can detect early signs of the disease.Credit: KH Fung/SPL

From biomedicine to political science, researchers are increasingly using machine learning as a tool to make predictions based on patterns in their data. But the claims of many such studies are likely to be overstated, according to two researchers at Princeton University in New Jersey. They want to sound the alarm about what they call a “crisis of brewing reproducibility” in machine learning-based science.

Machine learning is being sold as a tool that researchers can learn in hours and use on their own — and many are heeding that advice, says Princeton machine learning researcher Saash Kapoor. “But you wouldn’t expect a chemist to learn how to run a lab using an online course,” he says. And few scientists realize that the problems they encounter when applying artificial intelligence (AI) algorithms are common to other fields, says Kapoor, co-author of a preprint on ” crisis”.1. Peer reviewers don’t have time to review these patterns, so academia currently lacks mechanisms to weed out non-reproducible papers, he says. Kapoor and his co-author Arvind Narayanan have created guidelines for scientists to avoid such pitfalls, including an explicit checklist to submit with each paper.

What is reproducibility?

Kapoor and Narayanan’s definition of reproducibility is broad. He says other teams should be able to reproduce the results of a model, given all the details about the data, code and conditions – often referred to as computational reproducibility, which is already a concern for scientists. machine learning. The pair also define a model as non-reproducible when researchers make errors in analyzing the data, which means the model is not as predictive as claimed.

Judging such errors is subjective and often requires in-depth knowledge of the domain in which the machine learning is applied. Some researchers whose work has been criticized by the team disagree that their papers are flawed or say Kapoor’s claims are too strong. In social studies, for example, researchers have developed machine learning models that aim to predict when a country is likely to descend into civil war. Kapoor and Narayanan claim that once errors are corrected, these models perform no better than standard statistical techniques. But David Muchlinski, a political scientist at the Georgia Institute of Technology in Atlanta, whose article2 was reviewed by the couple, says the field of conflict prediction has been unfairly maligned and that follow-up studies support his work.

Still, the team’s rallying cry struck a chord. More than 1,200 people signed up for what was initially a small online replicability workshop on July 28, hosted by Kapoor and his colleagues, designed to find and spread solutions. “Unless we do something like this, every domain will continue to experience these issues over and over again,” he says.

Overoptimism about the powers of machine learning models could prove detrimental when algorithms are applied in areas such as health and justice, says Momin Malik, data scientist at the Mayo Clinic in Rochester, Minnesota. , who will speak at the workshop. . Unless the crisis is resolved, the reputation of machine learning could take a hit, he says. “I’m somewhat surprised that there hasn’t already been a collapse in the legitimacy of machine learning. But I think it could happen very soon.

Machine learning problems

Kapoor and Narayanan say similar pitfalls occur in applying machine learning to multiple sciences. The pair analyzed 20 journals in 17 research areas and counted 329 research articles whose results could not be fully reproduced due to problems applying machine learning.1.

Narayanan himself is not immune: a 2015 article on computer security he co-authored3 is one of 329. “It’s really a problem that needs to be solved collectively by this whole community,” Kapoor says.

The failures are not the fault of any individual researcher, he adds. Instead, a combination of AI hype and inadequate checks and balances are to blame. The most important problem highlighted by Kapoor and Narayanan is “data leakage”, when information from the dataset on which a model learns includes data on which it is then evaluated. If these are not entirely separated, the model has effectively seen the answers before, and its predictions look much better than they actually are. The team identified eight main types of data breaches that researchers can be vigilant against.

Some data leaks are subtle. For example, time leakage occurs when the training data includes points after the test data, which is problematic because the future depends on the past. As an example, Malik cites an article from 20114 which claimed that a model analyzing the mood of Twitter users could predict the closing value of the stock market with 87.6% accuracy. But because the team had tested the predictive power of the model using data from a time period before some of its trainings, the algorithm had effectively been allowed to see into the future, he says.

The broader issues include training models on datasets narrower than the population they are ultimately meant to reflect, Malik says. For example, an AI that spots pneumonia on chest X-rays that have only been trained on older people might be less accurate on younger people. Another problem is that algorithms often end up relying on shortcuts that don’t always stick, says Jessica Hullman, a computer scientist at Northwestern University in Evanston, Illinois, who will speak at the workshop. For example, a computer vision algorithm could learn to recognize a cow by the grassy background in most cow images, so it would fail when it encountered an image of the animal on a mountain or a beach.

The high accuracy of predictions in tests often fools people into thinking the models are capturing the “true structure of the problem” in a human-like way, she says. The situation is similar to the replication crisis in psychology, in which people place too much faith in statistical methods, she adds.

The hype around the capabilities of machine learning has played a role in researchers accepting their findings too easily, Kapoor says. The word “prediction” itself is problematic, says Malik, because most predictions are actually tested in hindsight and have nothing to do with predicting the future.

Troubleshoot data leaks

Kapoor and Narayanan’s solution to combating data leaks is for researchers to include evidence in their manuscripts that their models do not have each of the eight types of leaks. The authors suggest a model for such documentation, which they call “model information” sheets.

Over the past three years, biomedicine has come a long way with a similar approach, says Xiao Liu, a clinical ophthalmologist at the University of Birmingham, UK, who helped create reporting guidelines for studies involving AI, for example in screening or diagnosis. In 2019, Liu and his colleagues found that only 5% of more than 20,000 articles using AI for medical imaging were described in enough detail to determine if they would work in a clinical setting.5. The guidelines don’t directly improve anyone’s models, but they “make it really obvious who the people are who did it right, and who maybe the ones who didn’t do it well,” she says. , which is a resource that regulators can tap into.

Collaboration can also help, says Malik. He suggests that studies involve both specialists in the relevant discipline and researchers in machine learning, statistics and survey sampling.

Areas where machine learning finds leads — like drug discovery — stand to benefit hugely from the technology, Kapoor says. But other areas will need more work to show that it will be useful, he adds. Although machine learning is still relatively new in many fields, researchers need to avoid the kind of crisis of confidence that followed the replication crisis in psychology a decade ago, he says. “The longer we delay it, the bigger the problem will be.”

Sherry J. Basler