CoST

PAPER: CoST: Contrastive Learning of Disentangled Seasonal-Trend Representations for Time Series Forecasting

Motivations

  Previous work of time series representation learning is mainly focusing on three aspects as follows:

  • Reduce complexity (sparse attention) of Transformer for long sequence modeling
  • Sample proper contrastive pairs
  • Find better paradigm of data augmentation

  In fact, none of these approaches above truly leverages the structural information of time series. For example, recent work on video representation learning is trying to introduce structural priors (decouple video into context and motion) into training. Thus, an intuitive idea is to find structural information or invariant features of time series and take them as input.

  Compared to video, image or even natural language, time series is more redundant. Jointly learning features end-to-end from observed data may lead to the model over-fitting and capturing spurious correlations of the unpredictable noise. Some existing methods formulate time series as a sum of trend, seasonal and error variables like follows:

Imgur

  • $X$: observed data
  • $E$: unpredictable error/noise of $X$
  • $X^*$: error-free latent variable
  • $T,S$: trend and seasonal variable

  CoST assumes that seasonal and trend modules do not influence or inform each other. Even if one mechanism changes due to a distribution shift, the other remains unchanged. That is to say, CoST supposes that trend and seasonality are more invariant and robust features, from which we can learn better representation of the raw data.

CoST

CoST.png

  CoST takes casual TCN as its backbone encoder. Trend representations are learned in the time domain while seasonal representations are learned via frequency domain.

  • TFD: a mixture of CasualConv with the look-back windows of different size and a pooling layer
  • SFD: a learnable Fourier layer

  Both TFD and SFD are optimized by contrastive loss. Different views are augmented by scaling, shifting and jittering.

  • Positive samples: the different views of the same data
  • Negative samples: the same view of different data in the same mini-batch
\[L=L_{\text{time}}+\frac{\alpha}{2}(L_{\text{amp}}+L_{\text{phase}})\]

  Notice that CoST applies a dynamic dictionary like MoCo to save negative samples. To avoid coping with complex number in SFD, CoST directly uses amplitude and phase to capture representation in frequency domain.

Experiments

CoST_alpha.png

  Sensitivity analysis shows that the seasonal component has lower importance than the trend component in most cases. In other experiments, CoST takes $\alpha$ as 5e-4.

CoST_lab.png

  Ablation study shows the effect of all the components in CoST. TFD means they just use a single AR expert. Up to now, Transformer-based model is not better than existing methods in classification and forecasting tasks.

CoST_ETT.png

Compared to TS2Vec

  TS2Vec and CoST are both contrastive representation learning approaches. They have tried different data augmentation methods and hierarchical tricks:

  TS2Vec CoST
Mainly task classification forecasting
Data augmentation clip scale, shift and jitter
Hierarchical pooling features in optimization multi-scale extractor like FPN

CoST_3.png

  Apparently, CoST is better than TS2Vec due to introducing new structural priors in frequency domain. But we still need more experiments to compare the effect of different data augmentation and different hierarchical tricks. (e.g. DilatedConv)