Google Docs now automatically generates short summaries using machine learning

Many of us struggle to keep up with the daily flood of documents in our inboxes. These can be reports, reviews, ratings, policies, etc. Today’s readers want a concise summary that includes the main points of their paper, helping them prioritize their work effectively. However, writing a document summary manually from scratch is a time-consuming task.

To help document writers write content summaries, Google announced a new feature that allows Google Docs to automatically generate ideas when they’re available. The team uses a machine learning (ML) model to understand the text of the document and provide a natural language description of one to two sentences of the material. On the other hand, the document writer retains full control, choosing to accept the proposal as is, make any necessary adjustments to better capture the document’s summary, or ignore it altogether. This section, combined with the outline, can help readers understand and navigate the work at a high level. While anyone can contribute summaries, only professional Google Workspace customers have access to auto-generated ideas.

The promising results obtained by many machine learning algorithms for natural language understanding (NLU) and natural language generation (NLG) have made the automatic generation of summaries possible.

Abstract text summarization has been a problem in NLU and NLG research. Indeed, it combines the independently difficult tasks of understanding and creating a long document language. Training an ML model using sequence-to-sequence learning is a popular method for integrating NLU and NLG. In this method, the inputs are document words, which are then mapped to the output token, which are digest words.

Previous work used recurrent neural networks (RNNs) in sequence-to-sequence applications. Transformers use self-attention to better model long input and output dependencies, which is crucial in document synthesis. This is why they have become a promising alternative to RNNs as these models require a lot of manually labeled data to train.

In several NLU tasks with limited labeled data, transformers and self-supervised pre-training combined resulted in a big breakthrough. A model learns generic language generation and interpretation skills in self-supervised pre-training by consuming large amounts of unlabeled text. The model then learns to apply these talents to a specific goal in a later refinement stage.

The researchers extended this approach with pre-training objectives suitable for abstract synthesis in the Pegasus study. First, whole sentences of unlabeled articles and web documents are masked from input in Pegasus pre-training (also called Gap Sentence Prediction (GSP)). Then the model is needed to reconstruct them based on the remaining unmasked sentences. GSP, for example, uses a variety of heuristics to conceal phrases considered content-critical. The idea is to bring the pre-training as close as possible to the synthesis task. On a variety of synthetic datasets, Pegasus produced cutting-edge results. However, there were still a few hurdles before this search breakthrough could be turned into publicity.

The self-supervised pre-training produces an ML model capable of understanding and creating a generic language. However, adjustments are needed to make the model fit the application domain.

The team used a corpus of articles with human-created summaries (in line with typical usage scenarios) to refine the first iterations of the algorithm. However, this corpus had inconsistencies and a lot of variation because it included many different types of documents and many ways to write an abstract. For example, academic summaries are usually long and detailed, while analytical summaries are short and straight to the point. As the model was trained on various articles and summaries, it struggled to grasp the differences between them.

The results suggest that an effective pre-training phase requires less supervised data in the fine-tuning step. Pegasus matches the performance of Transformer baselines with over 10,000 supervised instances with as few as 1,000 fine-tuned examples in multiple synthesis assessments. This implies that quality could be prioritized over quantity.

Fine-tuning data has been rigorously cleaned and filtered to include training examples that are more consistent and represent a consistent definition of summaries. Despite using less training data, the model turned out to be of higher quality. This suggests that a smaller, high-quality dataset was preferable to a larger, high-variance dataset.

The transformer version of the encoder-decoder architecture is the most popular method for training models for sequence-to-sequence tasks such as abstract synthesis. However, it is observed to be useless and impractical in real-world applications. RNNs are a more efficient decoding architecture than transformers because there is no self-attention with previous tokens.

To integrate the Pegasus model into a hybrid architecture of a Transformer encoder and an RNN decoder, the team used knowledge distillation. It is about transferring knowledge from a large model to a smaller and more efficient model. They also reduced the number of RNN decoder layers to increase efficiency. The new model had significantly reduced latency and memory footprint while maintaining the same level of quality as the previous model. They serve the digest model using TPUs to further improve latency and user experience. TPUs provide tremendous speedups and allow more requests to be handled by a single machine.

Due to the enormous variety of documents, it is difficult to develop a collection of documents for the focus stage. The current model only offers a summary for the articles in which it has the most confidence. The researchers plan to extend this collection to other different summaries, for example, abstract summaries. Many separate summaries may be deemed correct for a given document, and different readers may favor different ones. It is therefore difficult to evaluate the abstracts solely on the basis of artificial analyses; User feedback and usage statistics will be crucial in helping us understand and improve quality.

Long documents are the most difficult for the model to summarize because it is more difficult to capture all the elements and summarize them in a single summary. Moreover, it can also significantly increase memory usage during training and serving. This is why it will be beneficial to auto-summarize as it allows document writers to get a head start on this time-consuming work. The team hopes that more research will help solve this problem.

The references:


Sherry J. Basler