Amazon AI researchers propose a new machine learning framework called “GRAVL-BERT”: BERT-based graphical visual and linguistic representations for multimodal coreference resolution

Source: https://assets.amazon.science/dd/09/ad8ef9424a1aa3d3069bccbdee2a/gravl-bert-graphical-visual-linguistic-representations-for-multimodal-coference-resolution.pdf

The use of multimodal data for AI training has grown in popularity, especially in recent years. The popularity of voice-activated screen devices like the Amazon Echo Show is growing due to their increased potential for multi-modal interactions. Customers can refer to products on screen using spoken language, making it easier for them to express their goals. Multimodal coreference resolution (MCR) refers to this process of selecting the appropriate object on the screen using natural language understanding. In order to create the next generation of conversational bots, references must be resolved across many modalities, such as text and visuals.

On visuo-linguistic tasks where they search for images that match a verbal description, multimodal models have in the past produced impressive results. However, coreference solving is the most difficult, in part because there are so many different ways to refer to an object on the screen. The most recent developments in this area use single-round statements and focus on simple coreference resolutions. By merging Graph Neural Networks with VL-BERT, a research team from Amazon and UCLA developed GRAVL-BERT. This unified MCR framework integrates visual relationships between objects, backgrounds, dialogs, and metadata. The model uses scene photos to resolve coreferences in multi-turn conversations.

This Amazon-UCLA model finished first in the tenth Dialog State Tracking Challenge (DSTC10) multimodal coreference resolution task. The team’s research paper was also presented at the International Conference on Computational Linguistics (COLING). Visual-Linguistic BERT (VL-BERT), a model trained on pairs of text and images, is the foundation of the GRAVL-BERT model. It extends the standard hidden language model training of the BERT model, in which parts of the input are hidden, and the model must learn to anticipate them. As a result, it gains the ability to predict images from text input and vice versa.

The team focused mainly on three important adjustments to this strategy. Graph neural networks have been used to generate the necessary graphical data from the relationships between elements in an image and display these relationships as graphs. Object metadata has been included as additional knowledge sources to enable co-referencing based on non-visual attributes such as brand or price. The most recent change required active sampling of an object’s vicinity to augment information about its surroundings and create callouts describing the thing. The model generates a binary decision for each object in the current scene as to whether or not it is the object mentioned in the current dialog exchange.

GRAVL-BERT creates a graph using the relative positions of objects in the scene, where nodes are objects and edges are the connections between them. This graph is then sent to a graph convolution network, which creates an integration for each node containing data about that node’s immediate environment in the graph. The coreference resolution model then uses these embeddings as inputs.

Additionally, the object recognizer may not always be able to identify certain components of a visual scene. However, consumers may still wish to use these components to describe objects. The researchers used knowledge of an object’s immediate environment to solve these criteria. There were two ways to record this. The first technique is to create eight boxes in eight different directions around the object. The visual input stream of the coreference resolution model was then supplemented with the visual characteristics of the image regions contained in these boxes. In the second technique, the group formed an image captioning model to describe other items close to the object of interest. These can be items such as bookshelves or bookshelves that helped the model identify objects based on descriptions of their surroundings.

According to the team, their model was able to achieve first place in the DSTC10 challenge by combining these changes with the measurement of dialogue-turn distance. The F1 score, which takes into account both false positives and false negatives, was used to assess the performance of the challenger. By making it easier for users of Alexa-enabled devices with screens to express their intentions, Amazon expects their work to benefit Alexa users.


Check paper, codedand reference article. All credit for this research goes to the researchers on this project. Also don’t forget to register. our Reddit page and discord channelwhere we share the latest AI research news, cool AI projects, and more.


Khushboo Gupta is an intern consultant at MarktechPost. She is currently pursuing her B.Tech from Indian Institute of Technology (IIT), Goa. She is passionate about the fields of machine learning, natural language processing and web development. She likes to learn more about the technical field by participating in several challenges.


Sherry J. Basler