Recent Progress
Due to the effective use of pre-trained models in NLP and CV tasks, researchers find that the scale of pre-training datasets is essential for representation learning. Therefore, some works attempt to utilize the large-scale weakly labeled cross-modality data on the internet, such as image and caption pairs on the forum or video-caption data on the video platform. Hence, more researchers are investigating cross-modal tasks, such as Vision-Language and Video-Language. Vision-Language tasks focus on image and text modalities, such as language-based image retrieval and image captioning, whereas Video-Language tasks emphasize video and text modalities, adding a temporal component to the imaging modality.
Contrastive Language–Image Pre-training(CLIP) is one of the most important works in the vision-language pre-training task. Benefiting from the large-scale image and text pairs collected from the internet, CLIP has shown a notable ability to align two modalities of data in the embedding space, which achieves supervising performance on downstream zero-shot visual recognition tasks.
Figure2 CLIP
As shown in Figure2, CLIP utilizes contrastive loss to learn multi-modal representations from weakly supervised data that are crawled from the internet. It constructs a super large-scale dataset that contains 400 Million image-text pairs. The model learned from such dataset obtained superior performance for zero-shot visual recognition tasks, such as image classification. For example, as shown in Fig 3, CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in the dataset. Researchers then use this behavior to turn CLIP into a zero-shot classifier. They convert all of a dataset’s classes into captions such as “a photo of a dog” and predict the class of the caption CLIP estimates best pairs with a given image. Without any fine-tuning on ImageNet, CLIP achieves strong performance on ImageNet, competitive with ResNet101.
Fig3 Zero-shot Test of CLIP
For another, the video contains multi-modality data in nature, i.e., titles, audio, and narrations. Some work performs self-supervised learning methods to obtain the pre-trained backbone models on the raw video data. And some large-scale video datasets have also been proposed, such as Howto100M, which contains 136 Million available videos with narration text data. As shown in Fig 4, these datasets boost the development of video-language pre-training results in the new area for video-understanding tasks.
Except for the increased available datasets, Transformer[3] recently dominated the CV and NLP for its performance. The transformer was first proposed in the field of Neural Language Processing (NLP) for machine translation and showed great performance in computer vision areas. As shown in Fig 5,
The transformer consists of several Encoder blocks and Decoder blocks to process input data, and Each encoder block contains a self-attention layer and a feed-forward layer, while each decoder block contains an encoder-decoder attention layer in addition to the self-attention and feed-forward layers. The self-attention layer dynamically calculates the similarity of the elements and aggregates the long-range dependency with these elements.
Fig 5 Transformer
Compared with traditional convolutional networks, the transformer contains less inductive bias, and the concise and stackable architecture of the transformer enables training on larger datasets, which promotes the development of pre-training & fine-tuning self-supervised learning.
Application
Generally, video-language pre-training aims to transfer knowledge of large-scale data to downstream tasks. The downstream tasks should contain both text and video input. For better transfer ability, the model structure also needs to be considered. And one of the most important is the compatibility of downstream and pretraining tasks. Common downstream tasks encountered during Video-Language pretraining include generative and classification tasks. The next subsections describe the task prerequisites and how to transfer information from pre-training to subsequent tasks.
Video-Text Retrieval
Video-text retrieval aims to search for an interesting video according to a natural language sentence or find out the text according to the given video, as shown in Fig. 6. This task requires both fine-grained multi-modality understanding for videos and texts. Text data and video data will be projected into a common embedding space for the distance calculation. Pre-training can provide robust common embedding space that is learned from large-scale data with the Video Language Matching(VLM) proxy task. It’s natural to adapt the proxy task of VLM to the downstream video-text retrieval task.
For another, some pre-training methods even achieve competitive results with zero-shot evaluation on the video-text retrieval test set, which validates the effectiveness of V& L pre-training.
Fig6 Video-Text Retrieval
Action Recognition
Action Recognition aims to classify a video segment into semantic action categories, as shown in Fig. 7. As a representative video understanding task, action recognition has been viewed as the basic benchmark for testing video feature learning. To this end, there is some attempt to transfer pre-trained knowledge of video language to action recognition by fine-tuning the backbone network to perform linear prob based on the backbone.
Fig7. Action Recognition
Video Question Answering
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos, as shown in Fig 8. Given a video and a question in natural language, the model produces accurate answers according to the content of the video. The VideoQA can be viewed as a classification task and divided into multiple-choice tasks or fill-in-the-blank tasks. Some works on multi-choice VideoQA feed candidate answers after the question sentence to build QA-aware global representations and then feed these global representations into an MLP-based classifier to achieve the matching score. The ultimate decision is determined by selecting the candidates with the highest similarity score. ActBERT presents a similar method for fill-in VideoQA, which adds a linear classifier to the cross-modal feature without the input of candidate text.
Fig .8 VideoQA
Video Captioning
Video Captioning is the task of automatically captioning a video by understanding the action and event in the video, which can help in the retrieval of the video efficiently through text, as shown in Fig 9. It is one of the most prevalent tasks for multi-modal analysis, and almost all research on Video-Language pre-training tests their pre-trained models on this task. Because the captioning task aims to produce a complete natural language sentence, this task is a generative task that is different from a classification task such as VideoQA. The simplest way to implement a video-language pre-training model into the video captioning task is to fine-tune the video extractor and text extractors. For another, Language Reconstruction(LR) proxy task can learn the generative ability between the visual and text modalities, which is helpful for the video captioning task.

We have presented the downstream tasks of video-language pre-training and the relation with proxy tasks. And we will introduce a related dataset about video-language pre-training, which plays a vital role in this area.
(To Be Continued)