Informer

PAPER: Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

Motivations

Long sequence time series forecasting demands a high prediction capacity of the model, which is the ability to capture precise long-range dependency coupling between input and output efficiently. Transformer based model can capture this long-range dependency within sequences through attention mechanism, while it can also predict the long sequences at one forward operation (seq2seq) rather than a step-by-step way like RNN. But vanilla Transformer has three significant limitations when solving the long time series forecasting problem:

The quadratic computation of self-attention
The memory bottleneck in stacking layers for long inputs
The speed plunge in predicting long outputs

Informer aims to reduce the computation of self-attention, so that it can solve longer inputs and outputs. Regarding the prediction of sequences as a special seq2seq task, we will illustrate the pipeline and some of details about how Informer works.

Informer

Informer model overview is as follows, which consists of three components: DataEmbedding, Encoder, Decoder. Notice that the input seq_x and output seq_y are overlapping, where the length of predicted sequence is denoted as pred_len.

seq_len, label_len, pred_len = 96, 48, 24
'''
    seq_x and seq_y are overlapping!
    len(seq_x) = seq_len + label_len
    len(seq_y) = label_len + pred_len
'''
data_set = Dataset_ETT_hour(
    root_path='data/ETT',
    data_path='ETTh1.csv',
    size=[seq_len, label_len, pred_len],
    features='M', # multivariate predict multivariate
    freq='h' # ['month','day','weekday','hour'], data-agnostic
)

DataEmbedding consists of three embedding approaches: 1) Value embedding captures the features of sequences by Conv1d; 2) Position embedding is the same as vanilla Transformer; 3) Temporal embedding generates data-agnostic temporal features through the given date (e.g. 2022-05-15 14:43).

class DataEmbedding(nn.Module):
    def __init__(self, c_in, d_model, embed_type='fixed', freq='h', dropout=0.1):
        super(DataEmbedding, self).__init__()

        self.value_embedding = TokenEmbedding(c_in=c_in, d_model=d_model)
        self.position_embedding = PositionalEmbedding(d_model=d_model)
        self.temporal_embedding = TemporalEmbedding(d_model=d_model, embed_type=embed_type, freq=freq) if embed_type!='timeF' else TimeFeatureEmbedding(d_model=d_model, embed_type=embed_type, freq=freq)

        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x, x_mark):
        x = self.value_embedding(x) + self.position_embedding(x) + self.temporal_embedding(x_mark)
        
        return self.dropout(x)

Encoder and Decoder of Informer are similar to vanilla Transformer. Notice that Informer uses ProbAttention rather than FullAttention due to the sparsity of attention matrix. ProbAttention randomly samples a fixed number of queries to calculate attention score. If the distribution of sampled attention score is similar to uniform distribution, we will conclude that this score is not important. ProbAttention uses KL divergence to measure the distribution and finds topk queries to obtain the final attention matrix we need. To reduce the redundancy of learned features, Informer also adds a ConvLayer to distill the knowledge. The figure of a single encoder is shown as below.

class Informer(nn.Module):
    def __init__(self, enc_in, dec_in, c_out, seq_len, label_len, pred_len, 
                factor=5, d_model=512, n_heads=8, e_layers=3, d_layers=2, d_ff=512):
        super(Informer, self).__init__()
        self.pred_len = pred_len

        # Data Embedding (B, L, feature) -> (B, L, d_model)
        self.enc_embedding = DataEmbedding(enc_in, d_model, embed, freq, dropout)
        self.dec_embedding = DataEmbedding(dec_in, d_model, embed, freq, dropout)

        # Encoder (B, seq_len, d_model) distil -> (B, seq_len / 4, d_model)
        self.encoder = Encoder(
            AttentionLayer(False, factor, d_model, n_heads),
            DistilConvLayer(d_model),
            LayerNorm(d_model)
        )

        # Decoder (B, label_len + pred_len, d_model)
        self.decoder = Decoder(
            AttentionLayer(True, factor, d_model, n_heads), # self-attention
            AttentionLayer(False, factor, d_model, n_heads), # cross-attention
            LayerNorm(d_model)
        )

        self.projection = nn.Linear(d_model, c_out)
        
    def forward(self, x_enc, x_mark_enc, x_dec, x_mark_dec, 
                enc_self_mask=None, dec_self_mask=None, dec_enc_mask=None):
        # [B, seq_len, D]
        enc_out = self.enc_embedding(x_enc, x_mark_enc)
        enc_out, attns = self.encoder(enc_out, attn_mask=enc_self_mask)

        # [B, label_len + pred_len, D]
        dec_out = self.dec_embedding(x_dec, x_mark_dec)
        dec_out = self.decoder(dec_out, enc_out, x_mask=dec_self_mask, cross_mask=dec_enc_mask)
        dec_out = self.projection(dec_out)
        
        return dec_out[:,-self.pred_len:,:]

Ablation study proofs the effects of all the components of Informer. Table 1 and 2 give us the performance of Informer on five cases.

However, Informer does not consider the impact brought by different data embedding ways. We conduct extra experiments on ETTh1 datasets to get the precise contribution of these embedding methods. Though the results tell us that these three embeddings are indispensable, we have to stress that some of the data-agnostic embeddings may destroy the information in original sequences. (e.g. Position embedding will introduce lots of local fluctuation in prediction). We should consider whether there is a better data embedding method.

Embedding	MSE	MAE
Value + Position + Temporal	0.529	0.521
Value + Temporal	0.601	0.574
Value + Position	0.520	0.529
Value	1.056	0.817