Can AI Machine Learning Models Overcome Biased Datasets?

A model’s ability to generalize is influenced by both the diversity of the data and how the model is trained, the researchers report.

AI systems may be able to complete tasks quickly, but that doesn’t mean they always do it fairly. If the datasets used to train machine learning models contain biased data, it is likely that the system could exhibit this same bias when making decisions in practice.

For example, if a dataset contains mostly images of white males, a facial recognition model trained with that data may be less accurate for females or people with different skin tones.

A group of researchers from MIT, in collaboration with researchers from Harvard University and Fujitsu Ltd., sought to understand when and how a machine learning model is able to overcome this type of dataset bias. They used an approach from neuroscience to study how training data affects the ability of an artificial neural network to learn to recognize objects it has never seen before. A neural network is a machine learning model that mimics the human brain in the way it contains layers of interconnected nodes, or “neurons,” that process data.

Machine learning models Biased dataset

If researchers are training a model to classify cars in pictures, they want the model to learn what different cars look like. But if every Ford Thunderbird in the training dataset is displayed head-on, when the trained model receives an image of a Ford Thunderbird taken from the side, it may misclassify it, even though it was trained on millions of car photos. Credit: Image courtesy of the researchers

The new results show that training data diversity has a major influence on a neural network’s ability to overcome bias, but at the same time, dataset diversity can degrade network performance. They also show that the way a neural network is trained and the specific types of neurons that emerge during the training process can play a major role in its ability to overcome a biased data set.

“A neural network can overcome dataset bias, which is encouraging. But the main takeaway here is that we need to consider the diversity of data. We need to stop thinking that if you only collect a ton of raw data, it will get you somewhere. We have to be very careful about how we design datasets first,” says Xavier Boix, a researcher at the Department of Brain and Cognitive Sciences (BCS) and the Center for Brains, Minds and Machines (CBMM). ), and main author of the article.

Co-authors include former MIT graduate students Timothy Henry, Jamell Dozier, Helen Ho, Nishchal Bhandari, and Spandan Madan, a corresponding author currently pursuing a PhD at Harvard; Tomotake Sasaki, a former Visiting Researcher now Senior Researcher at Fujitsu Research; Frédo Durand, Professor of Electrical Engineering and Computer Science at MIT and member of the Computer Science and Artificial Intelligence Laboratory; and Hanspeter Pfister, An Wang Professor of Computer Science at Harvard School of Engineering and Applied Sciences. Research appears today in Intelligence of natural machines.

Think like a neuroscientist

Boix and his colleagues approached the problem of dataset bias by thinking like neuroscientists. In neuroscience, Boix explains, it’s common to use controlled datasets in experiments—that is, a dataset in which researchers know as much as possible about the information it contains.

The team constructed datasets containing images of different objects in varying poses and carefully controlled the combinations so that some datasets had more diversity than others. In this case, a dataset has less diversity if it contains more images that show objects from a single point of view. A more diverse dataset contained more images showing objects from multiple viewpoints. Each dataset contained the same number of images.

The researchers used these carefully constructed datasets to train a neural network for image classification, then investigated how well it was able to identify objects from viewpoints the network did not see. during training (called off-distribution combining).

For example, if researchers are training a model to classify cars in pictures, they want the model to learn what different cars look like. But if every Ford Thunderbird in the training dataset is displayed head-on, when the trained model receives an image of a Ford Thunderbird taken from the side, it may misclassify it, even though it was trained on millions of car photos.

The researchers found that if the dataset is more diverse — if more images show objects from different viewpoints — the network is better able to generalize to new images or viewpoints. Diversity of data is key to overcoming bias, Boix says.

“But it’s not like more data diversity is always better; there is a tension here. When the neural network becomes better at recognizing new things it hasn’t seen, then it will become harder for it to recognize things it has seen before,” he says.

Test training methods

The researchers also investigated methods of training the neural network.

In machine learning, it is common to train a network to perform multiple tasks at the same time. The idea is that if there is a relationship between the tasks, the network will learn to perform each one better if it learns them together.

But the researchers found that the opposite was true: a model trained separately for each task was able to overcome bias much better than a model trained for both tasks together.

“The results were really striking. In fact, the first time we did this experiment, we thought it was a bug. It took us several weeks to realize it was a real result because it was so unexpected,” he says.

They dove deeper into neural networks to understand why this happens.

They found that the specialization of neurons seems to play a major role. When the neural network is trained to recognize objects in images, it appears that two types of neurons emerge – one specialized in object category recognition and the other specialized in viewpoint recognition.

When the network is trained to perform tasks separately, these specialized neurons are more important, Boix says. But if a network is trained to perform both tasks simultaneously, some neurons will dilute and not specialize for one task. These unspecialized neurons are more likely to get confused, he says.

“But the next question now is, how did these neurons get there? You train the neural network and they emerge from the learning process. Nobody told the network to include these types of neurons in its architecture. That’s what’s fascinating,” he says.

This is an area the researchers hope to explore with future work. They want to see if they can force a neural network to grow neurons with this specialization. They also want to apply their approach to more complex tasks, such as objects with complicated textures or varied illuminations.

Boix is ​​encouraged that a neural network can learn to overcome bias, and he hopes their work can inspire others to think more about the datasets they use in AI applications.

This work was supported, in part, by the National Science Foundation, a Google Faculty Research Award, the Toyota Research Institute, the Center for Brains, Minds, and Machines, Fujitsu Research, and the MIT-Sensetime Alliance on Artificial Intelligence.

Sherry J. Basler