Better foundation models for video presentation

Recent Basic Models-As Large Language Models-Has achieved advanced performance by learning how to restructure randomly masked text or images. Without any human supervision, these models can learn powerful representations from Large Corpora of unmarked data by simple “filling in the gaps”.

Related content

Four CVPR papers from Prime Video examine a wide set of topics related to effective model training for understanding and synthesis of long-shaped cinematic content.

However, generalizing this approach to video data is not straightforward. If the masking is random, the model may look at frames adjacent to the current one to fill in gaps. If a fixed region, on the other hand, is masked in a successive framework, the model may learn background rather than humans and objects due to camera movement. These shortcuts can reduce the quality of the learned representation and thus of the benefit of downnstream tasks, such as video action recognition.

At this year’s International Conference on Computer Vision (ICCV) Prime Video presented a new masking algorithm for masked video modeling called motion -controlled masking (MGM). MGM produces masks that track movement across successive frames of video to ensure the semantic texture of the masked regions and increase the difficult reconstruction task.

Of crucial importance utilizes our approaches movement vectors that are already part of modern video compression algorithms rather than optical current, which is expensive to calculate on the go. This enables very scalable self -monitored training of large video models.

This number, there is compatistian previously random masking algorithms for our proof-of-concept algorithm and the final masked-video modeling algorithm. Random masking applied independently random masks on each frame; Simulated motion masking (SMM, Proof-of-concept algorithm) initializes a random mask that is propagated in random but space-computing way from frame to frame; And the final motion-controlled masking (MGM) algorithm uses movement vectors from the video that codes to precisely guide the position of the mask over time, where the highest-old regions from frame to frame to frame to frame.

In the experiment, we found that MGM could achieve advanced video presentations by using only a third as much training data as the best priest prior model. We also tested the representations produced by our model on several downnstream tasks and found that they improved the benefit by as much as 5% compared to the previous methods.

Semantic representation

Foundation models learn how to map input data to vectors in a Represh room where geometric relations between vectors correspond to semantic conditions between data. Masked training allows the models to learn the semantic conditions directly from the data without the need for human involvement. The goal is to produce generic representations that are useful for an endless series of downnstream tasks.

SFM.GIF

Related content

CVPR Papers examine the recovery of 3D information from camera movement and the learning of general representations from weakly annotated data.

The most meaningful elements of a video sequence are usal people and objects. A mask that does not track the semantic devices over time can ignore useful information and lead to noisy learned representations. The goal of our work is thus to produce a “motion -controlled” mask that tracks these semantic devices over time.

A naive way of achieving this would be to run an object detector per day. Frame, select an object random, and mask out of the border box that surrounds this object in each frame. But in terms of calculation, this would be extremely expensive.

Fortunately, modern video compression schemes already contain information that can be used to estimate movement from frame to frame to frame, and our method used this information directly, which dramatically reduces the calculation barley.

Movement vectors

Digital video generally plays with prices between 24 and 30 images per second. Instead of storing the color value for each pixel per Frame, Compresses Modern Video Code video by opposing the fact that most of the video generally changes gradually from frame to frame.

The coded version of a video consists of intracoded frames (I-FRAMES), which are conventional digital images; Movement vectors that define how 8-by-8 (or 16 to 16) blocks of pixel values ​​in i-frames move from frame to frame; And residues that update individual pixel values ​​that cannot be recovered from the relatively coarse vectors. If the motion vectors are thinly assigned 8-to-8-pixel blocks, they only require 1/64. So much memory as conventional images. This sparsity means that coded video file can be stored more efficiently than fully decoed RGB frames can.

Shotcol-Maisel.gif

Related content

Prime Video beats previous technique on the movyet -the datas set by 13% with a new model that is 90% smaller and 84% faster.

We utilize the design of modern video codeCS to get effective movement information. Movement vectors codes for displacements to pixel blocks in two dimensions. In our paper, we analyzed the average movement of the foreground and background in popular internet video data sets and found that the movement was high in the foreground on average.

We thus use movement vectors as a power of attorney to determine the regions of interest in mask. Our MGM algorithm masks a rectangular region around the high horses movement per day. Frama, and the model is asked to reconstruct this 3D amount of masked video.

In our experience, we compared MGM to six previous mask video approaches. All of these approaches used random masking, which is not spacious continuous. In our ablation studies, we also tested other masking schemes that have different degrees of spatiotemporal continuity and movement guidance to capture the extent to which movement guidance helps to improvve video presentation learning.

We evaluated MGM on two different data sets using the trained model to predict masked image features in an evaluation data set and found that it exceeded the prior masked video games everywhere. It can also match the performance of the previously best-priesting method after training only one third so much data.

A comparison of our approval to six previous venerosk-mass approaches, mapping accuracy againt number of exercise poker.

We then compared representations that were generated using our approach to the random masked baseline of three other tasks, which achieved relative improvements of up to 5%. This suggests that motion -controlled masking is better at capturing semantic information about video content than other video casking techniques.

In summary, we offer MGM, a movement-ware masking algorithm for video that utilizes effective movement guidance already present in popular video formats to improve video presentation learning. For more details, see ICCV 2023 paper.

Leave a Comment