FNet
Vanilla attention mechanism aims to connect each token in the input through a relevance weighted basis of every other token, which requires large usage of computation and memory. Synthesizer and other relevant researches have challenged the necessity of attention sublayer based on dot product. FNet also proposes an alternative for attention, which directly utilizes non-parameters Fourier Transform to capture token-wise interaction. Here are the illustration and the code of FNet.
1 |
|
Notably, modified FNet (DCT, extra learnable parameters) degraded accuracy and reduced training stability. More experiments and precise explanation should be taken on existing token-mixing approaches. More experiments and results of FNet could be seen at the original paper.