Researchers at the University of Hamburg propose a machine learning model, called “LipSound2”, which directly predicts speech representations from raw pixels


The aim of the article presented in this article is to reconstruct speech solely from sequences of images of people speaking. Speech generation from silent videos can be used for many applications: for example, silent visual input methods used in public environments for privacy protection or speech understanding in surveillance videos.

The main challenge in reconstructing speech from visual information is that human speech is produced not only by observable movements of the mouth and face, but also by the lips, tongue, and internal organs like the cords. voice. Additionally, it is difficult to visually distinguish phonemes like “v” and “f” solely by mouth and facial movements.

This article exploits the natural co-occurrence of audio and video streams to pre-train a video-audio speech reconstruction model through self-supervision.

As shown in Figure 1a, at first the LipSound2 model is pre-trained by self-supervised learning on an audio-visual dataset to map silent videos (i.e. face image sequences) on a mel spectrogram without any human annotation. Therefore, this pre-trained model is refined on other specific datasets. Finally, the existing WaveGlow network is used to generate the waveforms corresponding to the mel-spectrograms. In the second part of the framework (Figure 1b), the video-to-text conversion process is performed by refining the generated audios (i.e. waveforms) on Jasper, a pre-trained acoustic model.

Before seeing in detail the architecture of LipSound2, let’s talk about self-supervised learning. This learning approach relies on unlabeled data to learn meaningful features from the data, through self-generated labels directly from the data. A pretext task must be chosen to determine what these self-generated labels are. For example, in computer vision, the pretext task might be to predict the angles of rotated images or colorize grayscale images. In particular, this article relies on cross-modality self-supervised learning, since, as we will see later, audio signals are used as supervision of video inputs.

Figure 2 shows a detailed version of the architecture of the LipSound2 model. Video clips in the dataset are split into an audio stream and a visual stream. Images from the visual stream are pre-processed to provide cropped face sequences as input to the model. These sequences pass through the encoder, which is composed of 3D CNN blocks, i.e. blocks comprising a 3D convolutional neural network layer, a batch normalization layer, a ReLU activation function, a Max layer Pooling and a final Dropout layer. The encoder output vector is then generated by two bi-directional LSTM layers, which capture long-distance dependencies.

The decoder involves a location-aware attention mechanism that uses the attention weights of previous decoder timestamps as additional functionality. A single LSTM layer receives as input the weighted attention content vector which is generated by multiplying the output and encoder location attention weights. The output of the LSTM is passed to a linear projection layer to produce the target spectrogram frame by frame. To guide the learning process through self-supervised learning, the original audio corresponding to the input face sequence is used as the training target. Additionally, during the training process, ground truth spectrogram frames are used to speed up the training process, while, during the inference stage, the model uses the outputs from the previous stages.

Since at each time step the decoder only receives past information, 5 convolutional layers (i.e. Postnet) are added after the decoder to further improve the model. The idea of ​​Postnet is to ease the transition between adjacent frames and to use future information that is not available during decoding. The loss function considered to form LipSound 2 is the sum of two mean square errors (MSE): the MSE between the output of the decoder and the target spectrogram and the MSE between the Postnet output and the target mel spectrogram.

Finally, each mel spectrogram is input to WaveGlow to generate the original waveform, which is then used by the Jasper speech recognition system to generate text from the speech signals. The advantage of using mel spectrograms during the learning process is to reduce computational complexity and learn long-range dependencies.

Paper: LipSound2: self-supervised pre-training for lip-speech reconstruction and lip-reading


Sherry J. Basler