Motivations
Modern deep vision architectures consist of layers that mix features (i) at a given spatial location, (ii) between different spatial locations, or both at once. In CNNs, (ii) is implemented with $N\times N$ convolutions (for $N>1$) and pooling. Neurons in deeper layers have a larger receptive field due to downsampling. Especially, $1\times 1$ convolutions perform (i) and larger kernels perform both (i) and (ii). In Vision Transformers and other attention-based architectures, self-attention layers allow both (i) and (ii) and the FFN perform (i).
We can summarize two types of mix features above as the per-location operations (channel-mixing) and cross-location operations (token-mixing). The idea behind the MLP-Mixer architecture is separately extract two types of features across channels and tokens using MLPs.
MLP-Mixer
The figure below illustrates the architecture of MLP-Mixer. MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. “mixing” the per-location features), and one with MLPs applied across patches (i.e. “mixing” spatial information). The computational complexity of the MLP-Mixer is linear in the number of input patches, unlike ViT whose complexity is quadratic. Unlike ViTs, Mixer does not use position embeddings because the token-mixing MLPs are sensitive to the order of the input tokens. The runnerable codes in PyTorch are as follows.
1 |
|
More experiments and results could be seen in original paper. As an alternative of attention-based architectures, MLP-Mixer has a simpler but more efficient structure. Attention may be not all you need.