Motivations
Long sequence time series forecasting demands a high prediction capacity of the model, which is the ability to capture precise long-range dependency coupling between input and output efficiently. Transformer based model can capture this long-range dependency within sequences through attention mechanism, while it can also predict the long sequences at one forward operation (seq2seq) rather than a step-by-step way like RNN. But vanilla Transformer has three significant limitations when solving the long time series forecasting problem:
- The quadratic computation of self-attention
- The memory bottleneck in stacking layers for long inputs
- The speed plunge in predicting long outputs
Informer aims to reduce the computation of self-attention, so that it can solve longer inputs and outputs. Regarding the prediction of sequences as a special seq2seq task, we will illustrate the pipeline and some of details about how Informer works.
Informer
Informer model overview is as follows, which consists of three components: DataEmbedding
, Encoder
, Decoder
. Notice that the input seq_x
and output seq_y
are overlapping, where the length of predicted sequence is denoted as pred_len
.
1 |
|
DataEmbedding
consists of three embedding approaches: 1) Value embedding captures the features of sequences by Conv1d
; 2) Position embedding is the same as vanilla Transformer; 3) Temporal embedding generates data-agnostic temporal features through the given date (e.g. 2022-05-15 14:43).
1 |
|
Encoder and Decoder of Informer are similar to vanilla Transformer. Notice that Informer uses ProbAttention
rather than FullAttention
due to the sparsity of attention matrix. ProbAttention
randomly samples a fixed number of queries to calculate attention score. If the distribution of sampled attention score is similar to uniform distribution, we will conclude that this score is not important. ProbAttention
uses KL divergence to measure the distribution and finds topk queries to obtain the final attention matrix we need. To reduce the redundancy of learned features, Informer also adds a ConvLayer
to distill the knowledge. The figure of a single encoder is shown as below.
1 |
|
Ablation study proofs the effects of all the components of Informer. Table 1 and 2 give us the performance of Informer on five cases.
However, Informer does not consider the impact brought by different data embedding ways. We conduct extra experiments on ETTh1 datasets to get the precise contribution of these embedding methods. Though the results tell us that these three embeddings are indispensable, we have to stress that some of the data-agnostic embeddings may destroy the information in original sequences. (e.g. Position embedding will introduce lots of local fluctuation in prediction). We should consider whether there is a better data embedding method.
Embedding | MSE | MAE |
---|---|---|
Value + Position + Temporal | 0.529 | 0.521 |
Value + Temporal | 0.601 | 0.574 |
Value + Position | 0.520 | 0.529 |
Value | 1.056 | 0.817 |