Metaformer
Metaformer is a special implementation of Synthesizer in computer vision, which is illustrated below. In vision tasks, we generally require a token mixer and a channel mixer to learn token-wise and channel-wise information. However, synthesizers reveal that specific mixers are not necessary. A random initialized matrix can capture token-token interaction. It is worth mentioning that AlterNet also thoroughly investigates how Vision Transformers work, finding that the attention in ViT ensembles input tokens as a trainable spatial smoothing of feature maps and flatten loss landscapes, inducing better performance and robustness due to data-specific aggregation instead of long-range dependency. (notably, MLP-Mixer underperforms compared to ViTs) See the original paper for more details.
1 |
|
Metaformer leverages the core idea of synthesizers and proposes a general structure for vision tasks. In their experiments, the pooling operation is used to model token-wise information and demonstrates good performance on several vision tasks. The figure below shows the ablation for Poolformer on ImageNet-1K classification benchmark. More details could be seen in original paper. I recommend you to read Synthesizer to understand intrinsic mechanism of metaformer.