Motivations
For self-supervised representation learning, the denoising/masked autoencoding methodology, like BERT, MAE, has been shown effective on learning good intermediate representation from the raw data. With less domain knowledge or fewer inductive biases, masked autoencoding methodology forces the model to learn useful knowledge almost purely from data, which enhances the scalability and transferability of the pretrained model.
Notice that masking ratio in masked autoencoding methods is related to the information redundancy of the problems. BERT uses a masking ratio of 15% for language and MAE uses a ratio of 75% for images, suggesting that images are more information-redundant than language. To apply MAE-style pretraining scheme to video data, we should consider proper masking strategies and ratio.
VideoMAE
VideoMAE is a simple extension of MAE to spacetime data whose main components are shown as below.
Similar as MAE, VideoMAE’s encoder is also a vanilla ViT with no factorization or hierarchy applied only on the visible set of embedded patches. PatchEmbedding
is also from vanilla ViT with no extra problem-specific tokenizer, which introduces less domain knowledge. VideoMAE compares different masking strategies and other component design. Table below shows some worthy results.
Figure (a) shows the influence of the masking ratio jointly with the pre-training length. The spacetime-agnostic sampling can be more effective than other structure-aware sampling strategies. (b) shows the impact of different reconstruction targets. Using per-patch normalized pixels as reconstruction targets with no inductive bias performs better. (c) shows that strong data augmentation is unnecessary due to a large number of views created by randomly sampling. (d)(e)(f) shows lighter design of decoder training is of acceptable performance and effective computation. More ablation study results and comparable experiments could be seen in original paper.