How well do methods of explaining machine learning models work? | MIT News

Imagine a team of doctors using a neural network to detect cancer in mammography images. Even though this machine learning model seems to work well, it could focus on image features that accidentally correlate with tumors, such as a watermark or timestamp, rather than actual signs of tumors.

To test these models, the researchers use “feature assignment methods,” techniques that are supposed to tell them which parts of the image are most important for the neural network’s prediction. But what if the attribution method lacks important features for the model? Since researchers don’t know which characteristics are important to begin with, they have no way of knowing if their assessment method is not effective.

To help solve this problem, the MIT researchers devised a process to modify the original data so they know which features are actually important to the model. Then, they use this modified dataset to assess whether feature assignment methods can correctly identify these important features.

They find that even the most popular methods often miss important features of an image, and some methods barely manage to perform as well as a random baseline. This could have major implications, especially if neural networks are applied in high-stakes situations like medical diagnostics. If the network isn’t working well, and attempts to detect such anomalies aren’t working well either, human experts may not know they’re being misled by the faulty model, says lead author Yilun Zhou, graduate student in electrical engineering and computer science. at the Computer Science and Artificial Intelligence Laboratory (CSAIL).

“All of these methods are very widely used, especially in some very high-stakes scenarios, such as detecting cancer from X-rays or CT scans. But these feature assignment methods might be wrong in the first place. They can highlight something that doesn’t match the true feature the model is using to make a prediction, which we’ve found to often be the case. If you want to use these feature assignment methods to justify that a model works correctly, you better make sure that the feature assignment method itself works correctly in the first place,” he says.

Zhou wrote the paper with EECS graduate student Serena Booth, Microsoft Research researcher Marco Tulio Ribeiro, and senior author Julie Shah, MIT professor of aeronautics and astronautics and director of the Interactive Robotics Group. at CSAIL.

Focus on features

In image classification, each pixel in an image is a feature that the neural network can use to make predictions, so there are literally millions of possible features that it can focus on. If researchers want to design an algorithm to help aspiring photographers improve, for example, they could train a model to distinguish photos taken by professional photographers from those taken by casual tourists. This model could be used to assess how well amateur photos look like professional photos, and even provide specific feedback on improvement. Researchers would like this model to focus on identifying artistic elements in professional photos during training, such as color space, composition, and post-processing. But it turns out that a photo taken by a professional probably contains a watermark of the photographer’s name, while few tourist photos have it, so the model might just take the shortcut to find the watermark.

“Obviously we don’t want to tell aspiring photographers that a watermark is all you need for a successful career, so we want to make sure our model focuses on the artistic characteristics rather than the presence of the watermark. It’s tempting to use feature attribution methods to analyze our model, but ultimately there’s no guarantee that they’ll work correctly, as the model might use artistic features, the watermark, or any another feature,” Zhou explains.

“We don’t know what these spurious correlations are in the dataset. There could be so many different things that could be completely unnoticeable to a person, like the resolution of an image,” Booth adds. “Even if it’s not noticeable to us, a neural network can probably extract these features and use them to classify. That’s the underlying problem. We don’t understand our datasets very well, but it’s also impossible to fully understand our datasets.”

The researchers modified the dataset to weaken all correlations between the original image and the data labels, which ensures that none of the original features will be more important.

Then they add a new feature to the image that is so obvious that the neural network needs to focus on it to make its prediction, like different colored glowing rectangles for different classes of images.

“We can confidently say that any model that achieves very high confidence should focus on this colored rectangle that we’ve put in place. Then we can see if all these feature assignment methods are rushing to highlight this location than anything else,” Zhou says.

“Particularly alarming” results

They applied this technique to a number of different feature assignment methods. For image classifications, these methods produce what is called a salience map, which shows the concentration of important features distributed across the entire image. For example, if the neural network classifies images of birds, the salience map may show that 80% of the important features are concentrated around the bird’s beak.

After removing all correlations in the image data, they manipulated the photos in several ways, such as blurring certain parts of the image, adjusting the brightness or adding a watermark. If the feature assignment method works well, nearly 100% of the important features should be located around the area manipulated by the researchers.

The results were not encouraging. None of the feature assignment methods came close to the 100% target, most barely achieved a random benchmark of 50%, and some even performed below benchmark in some cases. . So even if the new feature is the only one the model can use to make a prediction, feature assignment methods sometimes fail to detect it.

“None of these methods seem to be very reliable, across all the different types of false correlations. This is particularly alarming because, in natural datasets, we don’t know which of these false correlations might apply,” says Zhou: “It could be all kinds of factors. We thought we could trust these methods to tell us, but in our experience, it seems really hard to trust them.”

All of the feature assignment methods they studied were better at detecting abnormality than no abnormality. In other words, these methods could find a watermark more easily than they could identify that an image does not contain a watermark. So in this case, it would be harder for humans to trust a model that gives a negative prediction.

The team’s work shows that it is essential to test feature assignment methods before applying them to a real model, especially in high-stakes situations.

“Researchers and practitioners can use explanatory techniques such as feature attribution methods to elicit a person’s confidence in a model, but that confidence is not grounded unless the explanatory technique is correct. first rigorously evaluated,” says Shah. “An explanation technique can be used to help calibrate a person’s confidence in a model, but equally important is calibrating a person’s confidence in the model’s explanations.”

In the future, the researchers want to use their evaluation procedure to investigate more subtle or realistic features that could lead to spurious correlations. Another area of ​​work they want to explore is helping humans understand salience maps so they can make better decisions based on predictions from a neural network.

This research was supported, in part, by the National Science Foundation.

Sherry J. Basler