Buckets:

|
download
raw
151 kB

Title: TimelyGPT: Extrapolatable Transformer Pre-training for Long-term Time-Series Forecasting in Healthcare

URL Source: https://arxiv.org/html/2312.00817

Published Time: Tue, 10 Sep 2024 00:52:34 GMT

Markdown Content: Ziyang Song, Qincheng Lu, Hao Xu, He Zhu ,David Buckeridge School of Population and Global Health, McGill University 1140 Pine Avenue West Montreal Quebec Canada H3A1A3 and Yue Li School of Computer Science, McGill University Mila Quebec AI institute Montreal QC Canada H3A 2A7

(2024)

Abstract.

Motivation: Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success in Natural Language Processing and Computer Vision domains. However, the development of PTMs on healthcare time-series data is lagging behind. This underscores the limitations of the existing transformer-based architectures, particularly their scalability to handle large-scale time series and ability to capture long-term temporal dependencies.

Methods: In this study, we present Timely Generative Pre-trained Transformer (TimelyGPT). TimelyGPT employs an extrapolatable position (xPos) embedding to encode trend and periodic patterns into time-series representations. It also integrates recurrent attention and temporal convolution modules to effectively capture global-local temporal dependencies.

Materials: We evaluated TimelyGPT on two large-scale healthcare time series datasets corresponding to continuous biosignals and irregularly-sampled time series, respectively: (1) the Sleep EDF dataset consisting of over 1.2 billion timesteps collected from 197 whole-night polysomnographic sleep recordings, containing EEG, EOG, EMG, and event marker; (2) the longitudinal healthcare administrative database PopHR, comprising 489,000 patients randomly sampled from the Montreal population.

Results: Our experiments show that during pre-training, TimelyGPT excels in learning time-series representations from continuously monitored biosignals and irregularly-sampled time series data commonly observed in longitudinal electronic health records (EHRs), which can aid in healthcare time-series forecasting tasks. In forecasting continuous biosignals, TimelyGPT achieves accurate extrapolation up to 6,000 timesteps of body temperature during the sleep stage transition, given a short look-up window (i.e., prompt) containing only 2,000 timesteps. For irregularly-sampled time series, TimelyGPT with a proposed time-specific inference demonstrates high top recall scores in predicting future diagnoses using early diagnostic records, effectively handling irregular intervals between clinical records. Together, we envision TimelyGPT to be useful in a broad spectrum of health domains, including long-term patient health state forecasting and patient risk trajectory prediction.

Time-series forecasting, Time-series pre-training, transfer learning, irregularly-sampled time series, biosignals, clinical diagnosis

††copyright: acmlicensed††journalyear: 2024††doi: XXXXXXX.XXXXXXX††conference: The 15th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics; Nov. 22-25, 2024; Shenzhen, Guangdong, PR China††isbn: 978-1-4503-XXXX-X/18/06††ccs: Applied computing Bioinformatics††ccs: Computing methodologies Transfer learning 1. Introduction

Time-series forecasting holds significant importance in healthcare, given its potential to trace patient health trajectories and predict medical diagnoses (Ma et al., 2023; Eldele et al., 2021). In the field of healthcare, there are two primary categories: continuously monitored and irregularly-sampled time series data. Continuous time-series, such as biosignals, have been extensively studied in various applications, including health monitoring (Stirling et al., 2020), disease classification (Phan et al., 2021), and physical activity prediction (Reiss et al., 2019). Irregularly-sampled time series are commonly found in clinical records, where spontaneous updates are made due to outpatient hospital visits or inpatient hospital stays (Zhang et al., 2022). The key challenge is to extract meaningful contextualized representations from these time-series to make accurate long-term forecasting. A promising approach is to adopt transfer learning (Ma et al., 2023). Initially, a model is pre-trained on large-scale datasets to learn contextualized temporal representations. This pre-trained model (PTM) is then fine-tuned to forecast target sequences.

Image 1: Refer to caption

Figure 1. TimelyGPT overview. a. TimelyGPT architecture. TimelyGPT consists of a convolution-subsampling tokenizer followed by L 𝐿 L italic_L decoder layers, with detailed overflow provided in Appendix B.3. b. Generative decoder with xPos embedding. Each decoder layer is coupled with extrapolatable position embedding (Section 3.1) that encodes trend and periodic patterns into representations, facilitating forecasting with extrapolation ability. c.Chunk-wise Retention. This module consists of parallel intra-chunk Retention and recurrent inter-chunk Retention, effectively handling long sequences in continuously monitored biosignals (Appendix B.2). d.Temporal Convolution (Section 3.3) captures nuanced local interactions from time-series representations.

The recent impressive achievements of Transformer PTMs in Natural Language Processing (NLP) and Computer Vision (CV) domains have inspired growing interest in time-series Transformer-based PTMs. Time-Series Transformer (TST) uses a mask-and-reconstruction pre-training strategy to extract contextualized representations from time series (Zerveas et al., 2020). Cross-Reconstruction Transformer (CRT) learns temporal representations by dropping and reconstructing certain segments from time series (Zhang et al., 2023). Additionally, Transformer PTMs have been applied to traffic (Zhao et al., 2022), tabular (Padhi et al., 2021), and speech time-series (Liu et al., 2020, 2021).

Transfer learning by pre-training on large time-series data followed by fine-tuning for long-term time series forecasting (LTSF) tasks is a promising avenue. However, existing studies primarily focus on training from scratch on limited data for LTSF tasks (Ma et al., 2023). These studies often introduce tailored architectures and attention modules to extract complex temporal dependencies (Zhou et al., 2021; Wu et al., 2022; Zhou et al., 2022). However, the scalability of these transformers on large datasets for LTSF tasks remains an open question (Kaplan et al., 2020). A recent study argues that the permutation-invariant nature of self-attention causes the loss of temporal information (Zeng et al., 2022). As a result, transformers often underperform compared to convolution-based models, potentially due to their struggles with local features and multi-scale features (Yue et al., 2022; Tang et al., 2021). Overall, existing research on time-series transformers often lacks rigorous evaluation on large datasets and does not consistently outperform conventional approaches on small data.

In this study, we provide an in-depth analysis of existing time-series Transformer models, covering key aspects such as the attention mechanism and position embedding. We argue that the seeming inadequacy of current transformer-based models for time-series data is due to their inability to model large-scale time series. Once these challenges are resolved, we would observe the typical scaling law found in NLP and CV domains (Kaplan et al., 2020; Zhai et al., 2022). Motivated by this insight, we present a novel framework called Timely Generative Pre-trained Transformer (TimelyGPT) (Fig. 1) that utilizes an extrapolatable position (xPos) embedding to encode trend and periodic patterns into time-series representations (Sun et al., 2022). TimelyGPT integrates recurrent attention (also known as Retention) and convolution modules for effectively capturing both global temporal dependencies and nuanced local interactions (Sun et al., 2023; Gulati et al., 2020).

The key contributions of our research are threefold:

  1. (1)We employ extrapolatable xPos embedding (Fig. 1b) to encode both trend and periodic patterns into time-series representations, facilitating long-term forecasting.
  2. (2)We extend recurrent attention (Fig. 1c) to handle both continuous and irregularly-sampled time series data;
  3. (3)We introduce convolution subsampling tokenizer (Fig. 1a) to extract features from raw time-series and temporal convolution (Fig. 1d) to sift local features among the timesteps.

Overall, our experimental results reveal that TimelyGPT effectively extrapolates temporal representations for long-term forecasting. This leads to highly effective pre-training on large-scale time-series biosignals and longitudinal EHR data, and ultimately superior task-specific fine-tuning performance compared to the existing methods.

  1. Related work

2.1. Self-attention in Transformer

Transformer employs an encoder-decoder architecture composed of L 𝐿 L italic_L layers of Transformer blocks (Vaswani et al., 2023). Each block consists of a self-attention layer followed by a feed-forward layer. For an input embedding 𝑿∈ℝ N×d 𝑿 superscript ℝ 𝑁 𝑑{\bm{X}}\in\mathbb{R}^{N\times d}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT, where N 𝑁 N italic_N is the number of tokens and d 𝑑 d italic_d is the hidden size, the self-attention mechanism is defined as:

(1)Attention⁢(𝑿)=Softmax⁢(𝑸⁢𝑲⊤d)⁢𝑽 Attention 𝑿 Softmax 𝑸 superscript 𝑲 top 𝑑 𝑽\text{Attention}({\bm{X}})=\text{Softmax}\left(\frac{{\bm{Q}}{\bm{K}}^{\top}}{% \sqrt{d}}\right){\bm{V}}Attention ( bold_italic_X ) = Softmax ( divide start_ARG bold_italic_Q bold_italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) bold_italic_V

where 𝑸,𝑲,𝑽=𝑿⁢𝑾 Q,𝑿⁢𝑾 K,𝑿⁢𝑾 V∈ℝ N×d formulae-sequence 𝑸 𝑲 𝑽 𝑿 subscript 𝑾 𝑄 𝑿 subscript 𝑾 𝐾 𝑿 subscript 𝑾 𝑉 superscript ℝ 𝑁 𝑑{\bm{Q}},{\bm{K}},{\bm{V}}={\bm{X}}{\bm{W}}{Q},{\bm{X}}{\bm{W}}{K},{\bm{X}}{% \bm{W}}_{V}\in\mathbb{R}^{N\times d}bold_italic_Q , bold_italic_K , bold_italic_V = bold_italic_X bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , bold_italic_X bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , bold_italic_X bold_italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT are the Query, Key, and Value matrices, respectively. The attention mechanism allows Transformer to model long-term dependencies effectively, making it extensively utilized in NLP and CV domains.

As one of the prominent time-series transformers, Conformer utilizes the self-attention mechanism to capture long-range global contexts in speech data (Gulati et al., 2020). When combined with convolution modules, Conformer enhances self-attention by exploiting fine-grained local patterns. Although widely successful, the quadratic complexity of self-attention with respect to sequence length has spurred the exploration of attention-free modules such as Multi-Layer Perceptron (MLP) (Tolstikhin et al., 2021), implicit long convolution (Poli et al., 2023), and Recurrent Neural Network (RNN) (Peng et al., 2023; Sun et al., 2023). In particular, RNN-based attention modules have scaled up to 14 billion parameters while maintaining competitive performance with linear training and constant inference complexities. These modules are particularly well-suited for time-series modeling by effectively capturing sequential dependencies (Gu et al., 2022). In this study, TimelyGPT integrates the Retention mechanism and convolution modules to effectively capture both global and local contexts.

2.2. Position embedding in Transformer

Transformer relies on position embedding to capture temporal relations, since the self-attention mechanism alone does not inherently discern token order (Shaw et al., 2018). Absolute position embedding, which commonly employs sinusoidal functions, adds positional encoding directly to token embeddings. However, this method only encodes discrete position indexes, making it less effective for continuous timescales such as trend and periodic patterns in time-series data (Zeng et al., 2022). In contrast, speech transformers utilize relative position embedding to handle continuous time by encoding positional information relative to token distances (Gulati et al., 2020). Rotary Position Embedding (RoPE), prevalent in numerous large language models (Brown et al., 2020; Touvron et al., 2023; Penedo et al., 2023), applies rotation matrices to encode time information from relative distances (Su et al., 2022). Additionally, the RNN-based Transformer Receptance Weighted Key Value (RWKV) uses exponential decay to encode time information based on relative distance (Peng et al., 2023). Bridging these techniques, xPos embedding utilizes both rotation and exponential decay to effectively capture long-term dependencies (Sun et al., 2022).

One challenge for Transformer is extrapolation, i.e., forecasting sequences longer than those seen during training, due to the difficulty in generalizing position embeddings to unseen positions (Press et al., 2022). Encoder-decoder architectures often concatenate the input sequence with a zero-padded placeholder for the target sequence and predict all timesteps at once, while encoder-only models encode input sequence for forecasting (Zhou et al., 2021; Nie et al., 2023). Both approaches struggle with extrapolation and rely heavily on their linear layer for forecasting (Li et al., 2023), limiting their effectiveness in LTSF tasks. To address the issue, Attention with Linear Biases (ALiBi) adjusts attention with penalties linearly correlated with token distances (Press et al., 2022). Building on this, xPos embedding employs exponential decay to assign penalties based on relative distances (Sun et al., 2022). Consequently, xPos can handle inference lengths up to eight times the training length while maintaining comparable performance. Our TimelyGPT extends xPos from the NLP domain to long-term forecasting in the time-series domain, focusing on exploring the underlying mechanisms that enable the temporal extrapolation.

  1. TimelyGPT Methodology

Our proposed TimelyGPT effectively pre-trains on unlabeled data using next-token prediction task to learn temporal representations (Fig. 1). It first processes time-series inputs using a convolution-subsampling tokenizer for token embedding (Fig. 1a). To extract meaningful temporal patterns, TimelyGPT integrates three technical contributions. First, TimelyGPT utilizes extrapolatable xPos embedding to encode trend and periodic patterns (Fig. 1b, Section 3.1). Second, TimelyGPT utilizes the Retention module to capture global content (Fig. 1c, Section 3.2). Third, TimelyGPT deploys the convolution module to capture the local content (Fig. 1d, Section 3.3). Integrating Retention and Convolution modules enables the modeling of interactions between global and local content.

3.1. Extrapolatable position embedding encodes temporal patterns

As our first contribution, TimelyGPT employs xPos to encode relative positional information into token embeddings based on the distance n−m 𝑛 𝑚 n-m italic_n - italic_m between token n 𝑛 n italic_n and m 𝑚 m italic_m(Sun et al., 2022). Given an input embedding 𝑿∈ℝ N×d 𝑿 superscript ℝ 𝑁 𝑑{\bm{X}}\in\mathbb{R}^{N\times d}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT for N 𝑁 N italic_N tokens at d 𝑑 d italic_d embedding dimensions, xPos is integrated into the n 𝑛 n italic_n-th token embedding 𝑿 n subscript 𝑿 𝑛{\bm{X}}_{n}bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT through rotation matrix e i⁢θ⁢n superscript 𝑒 𝑖 𝜃 𝑛 e^{i\theta n}italic_e start_POSTSUPERSCRIPT italic_i italic_θ italic_n end_POSTSUPERSCRIPT and exponential decay γ n superscript 𝛾 𝑛\gamma^{n}italic_γ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT:

𝑸n⁢𝑲m=𝑿 n⁢𝑾 Q⁢(γ⁢e i⁢θ)n−m⁢𝑿 m⁢𝑾 K=γ n−m⁢𝑸^n⁢𝑲^m subscript𝑸 𝑛 subscript𝑲 𝑚 subscript 𝑿 𝑛 subscript 𝑾 𝑄 superscript 𝛾 superscript 𝑒 𝑖 𝜃 𝑛 𝑚 subscript 𝑿 𝑚 subscript 𝑾 𝐾 superscript 𝛾 𝑛 𝑚 subscript^𝑸 𝑛 subscript^𝑲 𝑚\displaystyle\tilde{\bm{Q}}{n}\tilde{\bm{K}}{m}={\bm{X}}{n}{\bm{W}}{Q}(% \gamma e^{i\theta})^{n-m}{\bm{X}}{m}{\bm{W}}{K}=\gamma^{n-m}\hat{\bm{Q}}{n}% \hat{\bm{K}}{m}over~ start_ARG bold_italic_Q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over~ start_ARG bold_italic_K end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_γ italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_n - italic_m end_POSTSUPERSCRIPT bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = italic_γ start_POSTSUPERSCRIPT italic_n - italic_m end_POSTSUPERSCRIPT over^ start_ARG bold_italic_Q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over^ start_ARG bold_italic_K end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (2)where 𝑸^n=𝑿 n⁢𝑾 Q⁢e i⁢θ⁢n,𝑲^m=𝑿 m⁢𝑾 K⁢e−i⁢θ⁢m formulae-sequence where subscript^𝑸 𝑛 subscript 𝑿 𝑛 subscript 𝑾 𝑄 superscript 𝑒 𝑖 𝜃 𝑛 subscript^𝑲 𝑚 subscript 𝑿 𝑚 subscript 𝑾 𝐾 superscript 𝑒 𝑖 𝜃 𝑚\displaystyle\text{where}\quad\hat{\bm{Q}}{n}={\bm{X}}{n}{\bm{W}}{Q}e^{i% \theta n},,\hat{\bm{K}}{m}={\bm{X}}{m}{\bm{W}}{K}e^{-i\theta m}where over^ start_ARG bold_italic_Q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_i italic_θ italic_n end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_K end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_i italic_θ italic_m end_POSTSUPERSCRIPT

where θ 𝜃\theta italic_θ and γ 𝛾\gamma italic_γ indicate position-dependent rotation and decay hyperparameters (Su et al., 2022; Sun et al., 2022). The exponential decay γ n−m superscript 𝛾 𝑛 𝑚\gamma^{n-m}italic_γ start_POSTSUPERSCRIPT italic_n - italic_m end_POSTSUPERSCRIPT determines the intensity of remembering historical information, while the rotation matrix e i⁢θ⁢n superscript 𝑒 𝑖 𝜃 𝑛 e^{i\theta n}italic_e start_POSTSUPERSCRIPT italic_i italic_θ italic_n end_POSTSUPERSCRIPT captures the oscillation frequencies. This decay mechanism effectively attenuates the influence of distant tokens, aiding in capturing long-term dependencies and enhancing extrapolation ability (Sun et al., 2022).

While initially designed for language modeling, xPos provides a compelling way for time-series modeling, mirroring the seasonal-trend decomposition (Fig. 1c). Its exponential decay γ n−m superscript 𝛾 𝑛 𝑚\gamma^{n-m}italic_γ start_POSTSUPERSCRIPT italic_n - italic_m end_POSTSUPERSCRIPT naturally concentrates on recent times while diminishing the influence of distant times, reflecting the trend momentum of time-series. The rotation matrix e i⁢θ⁢(n−m)superscript 𝑒 𝑖 𝜃 𝑛 𝑚 e^{i\theta(n-m)}italic_e start_POSTSUPERSCRIPT italic_i italic_θ ( italic_n - italic_m ) end_POSTSUPERSCRIPT captures the seasonal component of time-series through sinusoidal oscillations.

In healthcare time-series, xPos embedding effectively encodes both trend and periodic patterns crucial for modeling continuous biosignals and irregular clinical records. For continuous biosignals, trend patterns such as body temperature and vital signs are key health indicators, while electrocardiograms (ECGs) exhibit periodic patterns reflecting the physiological rhythms of the human body. In irregularly-sampled clinical records, age-related susceptibility to illnesses is observed in longitudinal population studies using administrative health data (Ahuja et al., 2022; Song et al., 2022). Some EHRs also exhibit periodic patterns, especially for chronic diseases like COPD, which have alternating exacerbation and recovery cycles.

We hypothesize that xPos embedding can encode these trend and periodic patterns into token embeddings. By harnessing xPos, TimelyGPT can effectively model long-term dependencies essential for time-series forecasting. In Section 6.2, 6.3, and 6.4, we validated our hypothesis and explored the underlying mechanisms driving temporal extrapolation for forecasting beyond training length.

3.2. Retention for continuous and irregularly-sampled time series

As our second contribution, we adapt the Retention mechanism to effectively handle continuous time-series data (Sun et al., 2023). The Retention mechanism based on xPos can be reformulated as an RNN to naturally model time-series data. Given the xPos embedding in Eq 3.1, the forward-pass of the Retention mechanism can be computed in parallel over all tokens with a linear training complexity:

𝑸^n=𝑿 n⁢𝑾 Q⁢e i⁢θ⁢n,𝑲^m=𝑿 m⁢𝑾 K⁢e−i⁢θ⁢m,𝑽=𝑿⁢𝑾 V formulae-sequence subscript^𝑸 𝑛 subscript 𝑿 𝑛 subscript 𝑾 𝑄 superscript 𝑒 𝑖 𝜃 𝑛 formulae-sequence subscript^𝑲 𝑚 subscript 𝑿 𝑚 subscript 𝑾 𝐾 superscript 𝑒 𝑖 𝜃 𝑚 𝑽 𝑿 subscript 𝑾 𝑉\displaystyle\hat{\bm{Q}}{n}={\bm{X}}{n}{\bm{W}}{Q}e^{i\theta n},>\hat{\bm% {K}}{m}={\bm{X}}{m}{\bm{W}}{K}e^{-i\theta m},>{\bm{V}}={\bm{X}}{\bm{W}}{V}over^ start_ARG bold_italic_Q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_i italic_θ italic_n end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_K end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_i italic_θ italic_m end_POSTSUPERSCRIPT , bold_italic_V = bold_italic_X bold_italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT (3)Ret⁢(𝑿)=(𝑸^⁢𝑲^⊤⊙𝑫)⁢𝑽,𝑫 n⁢m={γ n−m,n≥m 0,n<m formulae-sequence Ret 𝑿 direct-product^𝑸 superscript^𝑲 top 𝑫 𝑽 subscript 𝑫 𝑛 𝑚 cases superscript 𝛾 𝑛 𝑚 𝑛 𝑚 0 𝑛 𝑚\displaystyle\text{Ret}({\bm{X}})=(\hat{\bm{Q}}\hat{\bm{K}}^{\top}\odot{\bm{D}% }){\bm{V}},>{\bm{D}}{nm}=\begin{cases}\gamma^{n-m},&n\geq m\ 0,&n<m\end{cases}Ret ( bold_italic_X ) = ( over^ start_ARG bold_italic_Q end_ARG over^ start_ARG bold_italic_K end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⊙ bold_italic_D ) bold_italic_V , bold_italic_D start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT = { start_ROW start_CELL italic_γ start_POSTSUPERSCRIPT italic_n - italic_m end_POSTSUPERSCRIPT , end_CELL start_CELL italic_n ≥ italic_m end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL italic_n < italic_m end_CELL end_ROW

where the decay matrix 𝑫∈R N×N 𝑫 superscript 𝑅 𝑁 𝑁{\bm{D}}\in R^{N\times N}bold_italic_D ∈ italic_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT and rotation matrix e i⁢θ⁢(n−m)superscript 𝑒 𝑖 𝜃 𝑛 𝑚 e^{i\theta(n-m)}italic_e start_POSTSUPERSCRIPT italic_i italic_θ ( italic_n - italic_m ) end_POSTSUPERSCRIPT encode trend and periodic patterns into token embedding, taking into account the distance between tokens n−m 𝑛 𝑚 n-m italic_n - italic_m. When reformulated as an RNN, the Retention in Eq.3.2 can be manifested in a recurrent forward-pass with a constant inference complexity. This reformulated RNN excels in capturing sequential dependencies from the time-series. To handle long sequences, we use chunk-wise Retention by segmenting the sequence into multiple, non-overlapping chunks (Fig. 1c). Consequently, chunk-wise Retention maintains a linear complexity for long sequences. We provide details about the three Retention forward-passes in Appendix B.2.

To accommodate irregularly-sampled time series, we modify the Retention mechanism as follows. Given N 𝑁 N italic_N samples {s 1,…,s N}subscript 𝑠 1…subscript 𝑠 𝑁{s_{1},\ldots,s_{N}}{ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, each sample s n subscript 𝑠 𝑛 s_{n}italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is represented as a tuple (x n,t n)subscript 𝑥 𝑛 subscript 𝑡 𝑛(x_{n},t_{n})( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), consisting of an observation x n subscript 𝑥 𝑛 x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and a timestep t n subscript 𝑡 𝑛 t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Given two samples s n subscript 𝑠 𝑛 s_{n}italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and s m subscript 𝑠 𝑚 s_{m}italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, the decay mask 𝑫 𝑫{\bm{D}}bold_italic_D is adapted according to the time gap Δ⁢t n,m=t n−t m Δ subscript 𝑡 𝑛 𝑚 subscript 𝑡 𝑛 subscript 𝑡 𝑚\Delta t_{n,m}=t_{n}-t_{m}roman_Δ italic_t start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT:

(4)Ret⁢(𝑿)=(𝑸⁢𝑲⊤⊙𝑫)⁢𝑽,𝑫 n⁢m={γ Δ⁢t n,m,t n≥t m 0,t n<t m formulae-sequence Ret 𝑿 direct-product 𝑸 superscript 𝑲 top 𝑫 𝑽 subscript 𝑫 𝑛 𝑚 cases superscript 𝛾 Δ subscript 𝑡 𝑛 𝑚 subscript 𝑡 𝑛 subscript 𝑡 𝑚 0 subscript 𝑡 𝑛 subscript 𝑡 𝑚\text{Ret}({\bm{X}})=({\bm{Q}}{\bm{K}}^{\top}\odot{\bm{D}}){\bm{V}},;{\bm{D}}% {nm}=\begin{cases}\gamma^{\Delta t{n,m}},&t_{n}\geq t_{m}\ 0,&t_{n}<t_{m}\end{cases}Ret ( bold_italic_X ) = ( bold_italic_Q bold_italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⊙ bold_italic_D ) bold_italic_V , bold_italic_D start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT = { start_ROW start_CELL italic_γ start_POSTSUPERSCRIPT roman_Δ italic_t start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , end_CELL start_CELL italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_CELL end_ROW

Image 2: Refer to caption

Figure 2. Two inference strategies for forecasting irregularly-sampled time series. (a) Trajectory-based inference. TimelyGPT autoregressively predicts the entire sequence at equal time intervals. The target intervals can then be taken from part of the inferred trajectory. (b) Time-specific inference. TimelyGPT directly predicts the target data point using historical hidden states and the gap between the target timestep and the last observed timestep.

For next token pre-training, the retention incorporates Δ⁢t n,n−1 Δ subscript 𝑡 𝑛 𝑛 1\Delta t_{n,n-1}roman_Δ italic_t start_POSTSUBSCRIPT italic_n , italic_n - 1 end_POSTSUBSCRIPT into the recurrent state variable 𝑺 n∈R d×d subscript 𝑺 𝑛 superscript 𝑅 𝑑 𝑑{\bm{S}}_{n}\in R^{d\times d}bold_italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT, :

𝑺 n=γ Δ⁢t n,n−1⁢𝑺 n−1+𝑲 n⊤⁢𝑽 n subscript 𝑺 𝑛 superscript 𝛾 Δ subscript 𝑡 𝑛 𝑛 1 subscript 𝑺 𝑛 1 superscript subscript 𝑲 𝑛 top subscript 𝑽 𝑛\displaystyle{\bm{S}}{n}=\gamma^{\Delta{t}{n,n-1}}{\bm{S}}{n-1}+{\bm{K}}{n% }^{\top}{\bm{V}}{n}bold_italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_γ start_POSTSUPERSCRIPT roman_Δ italic_t start_POSTSUBSCRIPT italic_n , italic_n - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_italic_S start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + bold_italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (5)Ret⁢(X n)=𝑸 n⁢𝑺 n Ret subscript 𝑋 𝑛 subscript 𝑸 𝑛 subscript 𝑺 𝑛\displaystyle\text{Ret}(X{n})={\bm{Q}}{n}{\bm{S}}{n}Ret ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT

where the base case 𝑺 1=𝟎 subscript 𝑺 1 0{\bm{S}}_{1}=\mathbf{0}bold_italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_0 in this recurrent relation.

At inference time, to forecast irregularly-sampled time series, we consider two recurrent inference strategies, namely trajectory-based inference and time-specific inference (Fig. 2). Both strategies make predictions based on a look-up window. The former autoregressively predicts a trajectory at equal time intervals. The latter directly makes prediction at a specific time point s n′=(x n′,t n′)subscript 𝑠 superscript 𝑛′subscript 𝑥 superscript 𝑛′subscript 𝑡 superscript 𝑛′s_{n^{\prime}}=(x_{n^{\prime}},t_{n^{\prime}})italic_s start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ). Specifically, knowing the target timestep t n′subscript 𝑡 superscript 𝑛′t_{n^{\prime}}italic_t start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and the last observed sample s n=(x n,t n)subscript 𝑠 𝑛 subscript 𝑥 𝑛 subscript 𝑡 𝑛 s_{n}=(x_{n},t_{n})italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), TimelyGPT outputs the embedding of the target token Ret⁢(X n′)=𝑸 n′⁢𝑺 n′Ret subscript 𝑋 superscript 𝑛′subscript 𝑸 superscript 𝑛′subscript 𝑺 superscript 𝑛′\text{Ret}(X_{n^{\prime}})={\bm{Q}}{n^{\prime}}{\bm{S}}{n^{\prime}}Ret ( italic_X start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = bold_italic_Q start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_italic_S start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, taking into account the time gap Δ⁢t n′,n=t n′−t n Δ subscript 𝑡 superscript 𝑛′𝑛 subscript 𝑡 superscript 𝑛′subscript 𝑡 𝑛\Delta t_{n^{\prime},n}=t_{n^{\prime}}-t_{n}roman_Δ italic_t start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_n end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and the recurrent state then becomes 𝑺 n′=γ Δ⁢t n′,n⁢𝑺 n+𝑲 n⊤⁢𝑽 n subscript 𝑺 superscript 𝑛′superscript 𝛾 Δ subscript 𝑡 superscript 𝑛′𝑛 subscript 𝑺 𝑛 superscript subscript 𝑲 𝑛 top subscript 𝑽 𝑛{\bm{S}}{n^{\prime}}=\gamma^{\Delta{t}{n^{\prime},n}}{\bm{S}}{n}+{\bm{K}}{% n}^{\top}{\bm{V}}_{n}bold_italic_S start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_γ start_POSTSUPERSCRIPT roman_Δ italic_t start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

3.3. Convolution modules for local interaction

Convolution methods excel at identifying localized interactions from time series (LeCun and Bengio, 1998). As the first part of our third contribution, we propose a convolution-subsampling tokenizer for feature extraction from the raw time-series input (Fig. 1a). Briefly, it uses multiple 1-D convolution layers to condense the time dimension and extract local features of the time-series. The convolution-subsampling tokenizer consists of two 1-D convolution layers with kernel size 3 and stride 2, reducing the sequence length to 1/4. Unlike the prevalent patching technique, which merely segments adjacent timesteps and features (Nie et al., 2023), the convolution tokenizer effectively captures local temporal interactions. More details are provided in Appendix B.3.

As the second part of our third contribution, we propose a temporal convolution module using a depth-wise separable convolution (Chollet, 2017), sifting local temporal features from the time-series representations. As shown in Fig. 1d, this module starts with a layer normalization, followed by a 1-D depth-wise convolution and a point-wise convolution layer, with batch normalization and swish activation after the depth-wise convolution. Integrating convolution and attention allows TimelyGPT to extract global-local feature interactions (Wu et al., 2020; Gulati et al., 2020). By stacking multiple decoder layers, each with a convolution module, TimelyGPT discerns multi-scale features that characterize patterns across varying time scales (Tang et al., 2021).

3.4. Computational complexity

TimelyGPT with its efficient Retention mechanism achieves O⁢(N)𝑂 𝑁 O(N)italic_O ( italic_N ) training complexity and O⁢(1)𝑂 1 O(1)italic_O ( 1 ) inference complexity. In contrast, BERT and GPT incur O⁢(N 2)𝑂 superscript 𝑁 2 O(N^{2})italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) training complexity and O⁢(N)𝑂 𝑁 O(N)italic_O ( italic_N ) inference complexity (Katharopoulos et al., 2020). The vanilla attention mechanism in the Transformer, Attention⁢(X)=Softmax⁢(Q⁢K T d)⁢V Attention 𝑋 Softmax 𝑄 superscript 𝐾 𝑇 𝑑 𝑉\text{Attention}(X)=\text{Softmax}(\frac{QK^{T}}{\sqrt{d}})V Attention ( italic_X ) = Softmax ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_V, introduces a training complexity of O⁢(N 2⁢d)𝑂 superscript 𝑁 2 𝑑 O(N^{2}d)italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ). This quadratic computational bottleneck prevents standard Transformer models from modeling long sequences (i.e., N>>d much-greater-than 𝑁 𝑑 N>>d italic_N >> italic_d).

TimelyGPT achieves linear training complexity by following research in linear transformers (Katharopoulos et al., 2020). In the Retention mechanism, Ret⁢(X n)=Q n⁢S n,S n=K n T⁢V n+γ⁢S n−1 formulae-sequence Ret subscript 𝑋 𝑛 subscript 𝑄 𝑛 subscript 𝑆 𝑛 subscript 𝑆 𝑛 superscript subscript 𝐾 𝑛 𝑇 subscript 𝑉 𝑛 𝛾 subscript 𝑆 𝑛 1\text{Ret}(X_{n})=Q_{n}S_{n},S_{n}=K_{n}^{T}V_{n}+\gamma S_{n-1}Ret ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_γ italic_S start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT, both Q n⁢S n subscript 𝑄 𝑛 subscript 𝑆 𝑛 Q_{n}S_{n}italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and K n T⁢V n superscript subscript 𝐾 𝑛 𝑇 subscript 𝑉 𝑛 K_{n}^{T}V_{n}italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT have O⁢(d 2)𝑂 superscript 𝑑 2 O(d^{2})italic_O ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) complexity. By recursively updating over N 𝑁 N italic_N timesteps, the total complexity becomes O⁢(N⁢d 2)𝑂 𝑁 superscript 𝑑 2 O(Nd^{2})italic_O ( italic_N italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). For inference, TimelyGPT proposes time-specific and trajectory-based methods. The trajectory-based inference recursively generates sequences with equally-spaced time intervals like the GPT model, incurring O⁢(N)𝑂 𝑁 O(N)italic_O ( italic_N ) inference complexity. In contrast, the time-specific inference directly predicts target time point with O⁢(1)𝑂 1 O(1)italic_O ( 1 ) complexity. Therefore, TimelyGPT achieves O⁢(N)𝑂 𝑁 O(N)italic_O ( italic_N ) training complexity and O⁢(1)𝑂 1 O(1)italic_O ( 1 ) inference complexity, making it computationally efficient and suitable for long sequences. We provided detailed discussion of computational bottleneck of Transformer and efficient linear Transformer in Appendix A.1.

  1. Data

4.1. Sleep-EDF dataset

The Sleep European Data Format (EDF) database, sourced from PhysioBank (Goldberger et al., 2000), contains sleep recordings from 153 healthy subjects (Kemp et al., 2000). These whole-night polysomnographic sleep recordings include 7 types of biosignalS: electroencephalogram (EEG) from Fpz-Cz and Pz-Oz electrode locations, electrooculogram (EOG), submental chin electromyogram (EMG), oro-nasal airflow, rectal body temperature, and an event marker. Both EEG and EOG signals were sampled at 100 Hz (i.e., the signals were recorded at a rate of 100 samples per second), while EMG and the other features were sampled at 1 Hz (i.e., 1 sample per second). Sleep patterns (hypnograms) were manually scored by trained technicians into five sleep stages. This biosignal dataset comprises a total of 1.2 billion timesteps, segmented into 300,700 sequences of 4,000 timesteps each. It provides large-scale continuous time-series data for training large models. In our experiment, we forecast all 7 biosignals.

4.2. PopHR database

The Population Health Record (PopHR) database hosts a massive amount of longitudinal claim data from the provincial government health insurer in Quebec, Canada (Régie de l’assurance maladie du Québec, RAMQ) on health service use (Shaban-Nejad et al., 2016; Yuan et al., 2018). In total, there are approximately 1.3 million participants in the PopHR database, which represents a randomly sampled 25% of the population in the metropolitan area of Montreal between 1998 and 2014. Cohort memberships are maintained dynamically by removing deceased residents and actively enrolling newborns and immigrants. We extracted irregularly-sampled time series from the patient clinical records in the PopHR database. Specifically, we converted ICD-9 diagnostic codes to phenotype codes (PheCodes) using the expert-defined PheWAS catalog(Denny et al., 2013, 2010). we selected 315 unique PheCodes each with over 50,000 token counts and excluded patients who had fewer than 50 PheCode tokens. This resulted in a dataset of 489,000 patients, averaging 112 diagnosis records each.

  1. Experiments

We first validated the scaling pattern of TimelyGPT, determining the optimal number of model parameters for different dataset sizes (Section 6.1). We then explored TimelyGPT’s extrapolation capabilities for long-term forecasting up to 6,000 timesteps in Sleep-EDF’s biosignal data, and analyzed extrapolation’s underlying mechanism through visualization (Section 6.2). Our evaluation extended forecasting to irregularly-sampled time series (Section 6.3). Furthermore, we conducted ablation studies to evaluate the contributions of various components (Section 6.4).

5.1. Pre-training and fine-tuning

During pre-training, TimelyGPT utilizes a next-token prediction task to learn general temporal representations from unlabeled data (Radford et al., 2019). Given a sequence with a [SOS] token, TimelyGPT predicts the subsequent tokens by shifting the sequence to the right. At the last layer, each token’s output representation is fed into a linear layer for next-token prediction. The pre-training loss is Mean Squared Error (MSE) for continuous signals (e.g., biosignal) and cross-entropy for discrete signals (e.g., diagnosis codes).

Among other Transformer baselines, PatchTST adopted a masking-based approach, masking 40% of its patches as zeros (Nie et al., 2023). CRT utilized a dropping-based pre-training, discarding up to 70% of patches (Zhang et al., 2023). For the Transformer models without established pre-training methods, we used a masking-based method by randomly masking 40% of timesteps (Zerveas et al., 2020).For downstream forecasting tasks, we employ end-to-end fine-tuning on the entire model. The final linear layer is utilized for making the forecasts. All Transformer models performed 20 epochs of pre-training with MSE loss, followed by 5 epochs of end-to-end fine-tuning.

5.2. Jointly forecasting multivariate biosignals from Sleep-EDF dataset

We utilized all seven features from the Sleep-EDF dataset for a multivariate forecasting task, applying standardization as preprocessing. The Sleep-EDF dataset was split into training (80%), validation (10%), and test (10%) sets. All models were pre-trained on the entire training set and fine-tuned on a randomly chosen 20% subset of the training data, with time-series data segmented into non-overlapping sequences. For pre-training, we chose an input length of 4,000 timesteps. For fine-tuning, we used a look-up window of 2,000 timesteps and varied forecasting windows of 720, 2,000, and 6,000 timesteps. We used MAE as a metric. We evaluated TimelyGPT against Informer (Zhou et al., 2021), Autoformer (Wu et al., 2022), FEDformer (Zhou et al., 2022), PatchTST (Nie et al., 2023), TimesNet (Wu et al., 2023), TS2Vec (Yue et al., 2022), and DLinear (Zeng et al., 2022). Based on the scaling law in Section 6.1, we set the model parameters for all transformers to around 18 million, with specific architectures and parameters detailed in TableS2.

5.3. Forecasting irregularly-sampled diagnostic codes from PopHR dataset

We assessed long-term forecasting task of the irregularly-sampled time series extracted from the PopHR database. We divided the dataset into training (80%), validation (10%), and testing (10%) sets. We pre-trained on the entire training set and fine-tuned on a 20% subset of training data. We used cross entropy and top-K 𝐾 K italic_K recall to evaluate the pre-training and fine-tuning, respectively. For forecasting, we set the look-up window to be 50 timestamps and the rest as the forecasting window, containing up to more than 100 timestamps (i.e., diagnosis codes).

For our TimelyGPT, we separately evaluated the performance of trajectory-based and time-specific inferences (Section 3.2). We compared with several transformer baselines, including Informer, Fedformer, AutoFormer, and PatchTST as well as the models designed for irregularly-sampled time series, namely mTAND (Shukla and Marlin, 2021) and SeFT (Horn et al., 2020). Given that diagnoses are discrete values, there was no need to utilize the convolution-subsampling tokenizer for TimelyGPT. Furthermore, we specified a patch size of 2 for PatchTST, indicating that every two adjacent timestamps are projected into a single patch. Based on the scaling law in Section 6.1, we set model parameters for all transformers to about 7.5 million, with specific architectures and parameters detailed in TableS2.

5.4. Model parameters

For all benchmark experiments, we tailored the architecture and parameters of TimelyGPT based on the scaling-law analysis (Section 6.1; Fig.3). Specifically, for the Sleep-EDF dataset, TimelyGPT was configured with 18 million parameters, and for the PopHR dataset, it was configured with 7.5 million parameters. While different Transformer models may have unique optimal hyperparameters, optimizing each model’s setup is computationally prohibitive with our current compute resources. For fairness of comparison, we compared TimelyGPT against all transformer baselines at the same model size (TableS2).

Image 3: Refer to caption

Figure 3. Test MAE of forecasting Sleep-EDF biosignals as a function of dataset sizes and parameter sizes. Both look-up and forecasting windows were set to 256 timesteps. TimelyGPT with more parameters tends to exhibit better performance when trained on larger datasets.

Image 4: Refer to caption

Figure 4. SleepEDF biosignal forecasting performances of TimelyGPT and seven state-of-the-art methods over various forecasting windows. a. MAE for 8 methods evaluated over 3 forecasting windows (720, 2000, and 6000 timesteps). b. Cross-correlation scores for the same methods and forecasting windows. The detailed numerical results are summarized in Table1.

  1. Results

6.1. Scalability of TimelyGPT

We evaluated the scalability of TimelyGPT on the large-scale Sleep-EDF dataset to determine the optimal model parameters with respect to different dataset sizes (Kemp et al., 2000). We selected subsets of the Sleep-EDF dataset with timesteps ranging from 10 5 superscript 10 5 10^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT to 10 9 superscript 10 9 10^{9}10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT, splitting each dataset into training (80%), validation (10%), and testing (10%) sets. Both look-up and forecasting windows were set to 256 timesteps for this experiment. TimelyGPT’s performance improves as parameter and dataset size increase (Fig. 3), which is attributed to its capacity to handle more data, known as the scaling law for Transformer (Kaplan et al., 2020). We provide further discussion of scaling patterns of the existing Transformer models in Appendix A.3.

Table 1. Comparison of TimelyGPT as well as 7 baselines for long-term forecasting experiment on the large-scale SleepEDF dataset. Bold and underlined numbers indicate the best and second best results for each metric and window.

Image 5: Refer to caption

Figure 5. Predicted sequence of SleepEDF biosignals of 6,000 timesteps. Given a 2,000 look-up window, we applied TimelyGPT (blue solid line) and 4 state-of-the-art methods (dashed lines) to predict the biosignals for the next 6,000 timesteps. The groundtruth biosignals are displayed as red solid line. The two vertical lines demarcate the look-up window and the length of pre-training sequences, respectively.

6.2. Forecasting multivariate Sleep-EDF biosignals

TimelyGPT achieved the best performance in forecasting biosignals for all windows in terms of MAE, except for window 720 (Table 1; Fig. 4a). PatchTST achieved the best MAE at 0.456, whereas TimelyGPT conferred comparable performance. DLinear was also effective for the 720-timestep forecasting window. However, as the forecasting window increased to 2,000 and 6,000 timesteps, both PatchTST and DLinear suffered performance drops due to their reliance on the linear layers and inability to extrapolate beyond the training length. In contrast, pre-trained on 4,000 timesteps, TimelyGPT consistently maintained superior performance up to 6,000 timesteps given a short look-up window (i.e., prompt) containing only 2,000 timesteps. Additionally, TimelyGPT consistently outperformed other baselines across all three forecasting windows in terms of cross-correlation performance (Fig. 4b; Table 1). TimesNet was the second best performer for these windows, but declined as window size gets larger due to the extrapolation issue. These results underscore TimelyGPT’s extrapolation capabilities in long-term forecasting, aligning with the findings in the NLP domain (Sun et al., 2022).

We visualized the predicted biosignals by TimelyGPT against the leading baselines (PatchTST and DLinear) and the ablated methods (GPT-2 and GPT-2 with RoPE), focusing on sleep stage transitions (Fig. 5). We utilized a 2,000-timestep look-up window and a 6,000-timestep forecasting window. Forecasting beyond 2,000 timesteps is marked as extrapolation, as it exceeds the training length. In the rectal temperature (i.e., trend signal), TimelyGPT’s forecast aligned well with the groundtruth, effectively capturing distinct trend patterns. Notably, the small bump in the prompt before the 1000-th timestep is a typical indicator for temperature drop. Most models were able to capture it except for DLinear, showing the benefits of pre-training. Beyond the training length of 4000, TimelyGPT demonstrated more advantages in accurately extrapolating the rise of the rectal temperature around 7000-th timestep while PatchTST and GPT fell behind. The superior extrapolation capabilities of TimelyGPT is attributable to its ability to capture the long-term trends with xPos embedding. In contrast, both PatchTST and vanilla GPT experienced a performance decline, likely due to the dependency on linear mapping as discussed in previous research (Li et al., 2023). Additionally, TimelyGPT exhibits superior extrapolation capabilities over the ablated baseline GPT+RoPE, highlighting its effective trend pattern modeling for extrapolation. We also visualized EEG periodic biosignal forecast and found a similar conclusion (Fig. S3).

6.3. Forecasting patient diagnosis trajectory

Image 6: Refer to caption

Figure 6. The distribution of top-5 recall performance for TimelyGPT with two inference methods (Time-specific and Trajectory-based), compared to PatchTST and MTand across three forecasting window sizes.

Image 7: Refer to caption

Figure 7. Visualization of a cancer patient’s medical trajectory from the PopHR dataset. a. Look-up and forecast windows. Matched predictions (solid circles) were identified when the top 5 predicted PheCodes contain the groundtruth. b. The top 5 predicted PheCodes for the final 5 timesteps of the subject.

We then applied TimelyGPT and the baseline methods to forecast 315 PheCodes for 489K patients from PopHR (Section 4.2). We evaluated the performance using the average top K recall at each forecast window. TimelyGPT with time-specific inference outperformed the baselines reaching the highest recall rates of 58.65% and 70.83% at K=5 𝐾 5 K=5 italic_K = 5 and K=10 𝐾 10 K=10 italic_K = 10, respectively (Table2). At K=15 𝐾 15 K=15 italic_K = 15, TimelyGPT ranked the second-highest recall with 82.69%. In addition, the time-specific inference outperformed the trajectory-based inference, highlighting the advantage of time decay mechanism.

Table 2. Forecasting results of TimelyGPT and 6 baselines on PopHR’s irregular-sampled time series dataset. TimelyGPT with time-specific inference achieved the highest recall at K=5 𝐾 5 K=5 italic_K = 5 and K=10 𝐾 10 K=10 italic_K = 10, and the second highest at K=15 𝐾 15 K=15 italic_K = 15, demonstrating its superior performance in long-term forecasting of irregularly-sampled time series.

We then examined the distributions of the top-5 recall rates at 3 forecast windows, comparing two inference methods of TimelyGPT with the best transformer baseline PatchTST and the leading irregular time series algorithm MTand (Fig. 6). TimelyGPT’s time-specific inference consistently outperformed trajectory-based inference as the forecasting window size increases. While both inference methods exhibited similar performance for predicting the first 50 timesteps, time-specific TimelyGPT demonstrated significantly better results beyond 50 timesteps. This improvement is likely due to time-specific inference taking into account the evolving states and the query timestep in the time decay mechanism, enhancing its ability to predict the temporal evolution of healthcare trajectories over irregular intervals. As expected, all models experienced a performance decline in predicting farther future because of the increasing uncertainties. Despite this, TimelyGPT maintained higher and more stable performance within the first 100 steps compared to PatchTST and MTand. Although MTand closely followed to time-specific TimelyGPT for the first 50 timesteps, its performance drastically declines as the forecasting window increases, reflecting its difficulty with extrapolation. These findings highlight the utility of the proposed time-specific inference in leveraging time-decay mechanism to handle irregularly-sampled time series for long-term forecasting.

We visualized the observed and predicted trajectory of a patient with neoplasm and genitourinary diseases (Fig. 7). TimelyGPT with time-specific inference produced a high top-5 recall rate of 85.7% on this patient. Indeed, most of the observed codes were among the top 5 predicted codes by the time-specific TimelyGPT. Zooming into the forecast window (Fig. 7b), TimelyGPT accurately predicted Phecodes 590.0 (Pyelonephritis) three times around the age of 61. TimelyGPT predicted PheCode 740.9 at age 61 with high probability, which appeared twice at ages 52 and 53 in the look-up window. Therefore, TimelyGPT demonstrated a promising direction to forecast patient health state despite the challenges inherent in modeling irregularly-sampled longitudinal EHR data.

Table 3. Ablation results of TimelyGPT w/o specific components, showing forecasting performance for a 6,000-timestep window in the Sleep-EDF dataset and top-15 recall rate in the PopHR dataset.

6.4. Ablation study

To assess the contributions of various components in TimelyGPT, we conducted ablation studies by omitting the key components, including convolution subsampling tokenizer, temporal convolution module, exponential decay, and RoPE relative position embedding. Notably, removing all components results in a vanilla GPT-2. Since exponential decay in xPos depends on RoPE, we cannot assess the impact of exponential decay independently by removing the RoPE component. Additionally, we also ablated the pre-training strategy by training TimelyGPT from scratch on the forecasting tasks. The ablation studies focused on downstream forecasting experiments using the Sleep-EDF and PopHR datasets, corresponding to continuous biosignals and irregularly-sampled time series, respectively. We conducted the ablation on long-term forecasting of 6000 timesteps in the Sleep-EDF dataset and evaluated the top-5 recall scores in the PopHR dataset.

As shown in Table3, for the Sleep-EDF forecasting task, removing the RoPE component led to the most significant performance degradation (a MAE of 0.357). The removal of exponential decay also led to increase MAE of 0.134, demonstrating its benefits of encoding trend patterns for long-term forecasting. Together, the two ablation experiments show the importance of xPos as our first main contribution (Section 3.1). The integration of convolution modules helps TimelyGPT capture local features, although the benefits were smaller compared with other components.

In the forecasting of irregularly-sampled time series, the exponential decay and RoPE components improved performance by 6.15% and 2.32%, respectively. The time decay mechanism encodes trend patterns into the modeling of patients’ health trajectories, making it a promising approach for forecasting irregular clinical diagnoses. Pre-training decreased MAE by 0.066 for forecasting continuous biosignals in Sleep-EDF and increased top K recall rate by 2.21% for forecasting irregularly sampled diagnostic codes.

  1. Conclusion and Future Work

TimelyGPT effectively forecasts long sequences of time-series, utilizing xPos embedding, recurrent attention, and convolution modules. For continuously monitored biosignals such as Sleep-EDF, TimelyGPT can accurately extrapolate up to 6,000 timesteps given only a 2000-timestep prompt. Moreover, TimelyGPT also effectively forecasts irregularly-sampled time series by conditioning the recurrent Retention on the time. In our future work, we will perform comprehensive and in-depth analysis on the trajectory inference of the EHR data, as it may have a profound impact on the future of patient care and early intervention. TimelyGPT is a causal model with unidirectional attention (Press et al., 2022). This may limit its expressiveness in terms of time-series representation learning, which may be improved via a bidirectional architecture. To enhance transfer learning, we will adapt TimelyGPT for out-of-distribution biosignals, further enhancing its utility in healthcare time-series.

References

  • (1)
  • Ahuja et al. (2022) Yuri Ahuja, Yuesong Zou, Aman Verma, David Buckeridge, and Yue Li. 2022. MixEHR-Guided: A guided multi-modal topic modeling approach for large-scale automatic phenotyping using the electronic health record. Journal of biomedical informatics 134 (2022), 104190.
  • Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165[cs.CL]
  • Chollet (2017) François Chollet. 2017. Xception: Deep Learning with Depthwise Separable Convolutions. arXiv:1610.02357[cs.CV]
  • Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. arXiv:1901.02860[cs.LG]
  • Denny et al. (2013) Joshua Denny, Lisa Bastarache, Marylyn Ritchie, Robert Carroll, Raquel Zink, Jonathan Mosley, Julie Field, Jill Pulley, Andrea Ramirez, Erica Bowton, Melissa Basford, David Carrell, Peggy Peissig, Abel Kho, Jennifer Pacheco, Luke Rasmussen, David Crosslin, Paul Crane, Jyotishman Pathak, and Dan Roden. 2013. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nature biotechnology 31 (11 2013). https://doi.org/10.1038/nbt.2749
  • Denny et al. (2010) Joshua C. Denny, Marylyn D. Ritchie, Melissa A. Basford, Jill M. Pulley, Lisa Bastarache, Kristin Brown-Gentry, Deede Wang, Dan R. Masys, Dan M. Roden, and Dana C. Crawford. 2010. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations. Bioinformatics 26, 9 (03 2010), 1205–1210. https://doi.org/10.1093/bioinformatics/btq126
  • Eldele et al. (2021) Emadeldeen Eldele, Mohamed Ragab, Zhenghua Chen, Min Wu, Chee Keong Kwoh, Xiaoli Li, and Cuntai Guan. 2021. Time-Series Representation Learning via Temporal and Contextual Contrasting. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21. 2352–2359.
  • Goldberger et al. (2000) A.L. Goldberger, L. Amaral, L. Glass, J. Hausdorff, P.C. Ivanov, R. Mark, J.E. Mietus, G.B. Moody, C.K. Peng, and H.E. Stanley. 2000. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101, 23 (2000), e215–e220. https://doi.org/0.1161/01.cir.101.23.e215
  • Gu et al. (2022) Albert Gu, Karan Goel, and Christopher Ré. 2022. Efficiently Modeling Long Sequences with Structured State Spaces. arXiv:2111.00396[cs.LG]
  • Gulati et al. (2020) Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. 2020. Conformer: Convolution-augmented Transformer for Speech Recognition. 5036–5040. https://doi.org/10.21437/Interspeech.2020-3015
  • Horn et al. (2020) Max Horn, Michael Moor, Christian Bock, Bastian Rieck, and Karsten Borgwardt. 2020. Set Functions for Time Series. arXiv:1909.12064[cs.LG]
  • Jiang et al. (2023) Jiawei Jiang, Chengkai Han, Wayne Xin Zhao, and Jingyuan Wang. 2023. PDFormer: Propagation Delay-Aware Dynamic Long-Range Transformer for Traffic Flow Prediction. arXiv:2301.07945[cs.LG]
  • Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361[cs.LG]
  • Katharopoulos et al. (2020) Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. arXiv:2006.16236[cs.LG]
  • Kemp et al. (2000) B. Kemp, A.H. Zwinderman, B. Tuk, H.A.C. Kamphuisen, and J.J.L. Oberye. 2000. Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the EEG. IEEE Transactions on Biomedical Engineering 47, 9 (2000), 1185–1194. https://doi.org/10.1109/10.867928
  • LeCun and Bengio (1998) Yann LeCun and Yoshua Bengio. 1998. Convolutional Networks for Images, Speech, and Time Series. MIT Press, Cambridge, MA, USA, 255–258.
  • Li et al. (2023) Zhe Li, Shiyi Qi, Yiduo Li, and Zenglin Xu. 2023. Revisiting Long-term Time Series Forecasting: An Investigation on Linear Mapping. arXiv:2305.10721[cs.LG]
  • Liu et al. (2021) Andy T. Liu, Shang-Wen Li, and Hung yi Lee. 2021. TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 2351–2366. https://doi.org/10.1109/taslp.2021.3095662
  • Liu et al. (2020) Andy T. Liu, Shu wen Yang, Po-Han Chi, Po chun Hsu, and Hung yi Lee. 2020. Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. https://doi.org/10.1109/icassp40776.2020.9054458
  • Ma et al. (2023) Qianli Ma, Zhen Liu, Zhenjing Zheng, Ziyang Huang, Siying Zhu, Zhongzhong Yu, and James T. Kwok. 2023. A Survey on Time-Series Pre-Trained Models. arXiv:2305.10716[cs.LG]
  • Nie et al. (2023) Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. 2023. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. arXiv:2211.14730[cs.LG]
  • Padhi et al. (2021) Inkit Padhi, Yair Schiff, Igor Melnyk, Mattia Rigotti, Youssef Mroueh, Pierre Dognin, Jerret Ross, Ravi Nair, and Erik Altman. 2021. Tabular Transformers for Modeling Multivariate Time Series. arXiv:2011.01843[cs.LG]
  • Penedo et al. (2023) Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. arXiv:2306.01116[cs.CL]
  • Peng et al. (2023) Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Xiangru Tang, Bolun Wang, Johan S. Wind, Stansilaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Jian Zhu, and Rui-Jie Zhu. 2023. RWKV: Reinventing RNNs for the Transformer Era. arXiv:2305.13048[cs.CL]
  • Phan et al. (2021) Huy Phan, Oliver Y. Chen, Minh C. Tran, Philipp Koch, Alfred Mertins, and Maarten De Vos. 2021. XSleepNet: Multi-View Sequential Model for Automatic Sleep Staging. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 1–1. https://doi.org/10.1109/tpami.2021.3070057
  • Poli et al. (2023) Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. 2023. Hyena Hierarchy: Towards Larger Convolutional Language Models. arXiv:2302.10866[cs.LG]
  • Press et al. (2022) Ofir Press, Noah A. Smith, and Mike Lewis. 2022. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. arXiv:2108.12409[cs.CL]
  • Radford et al. (2022) Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv:2212.04356[eess.AS]
  • Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. https://api.semanticscholar.org/CorpusID:160025533
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683[cs.LG]
  • Reiss et al. (2019) Attila Reiss, Ina Indlekofer, Philip Schmidt, and Kristof Van Laerhoven. 2019. Deep PPG: Large-Scale Heart Rate Estimation with Convolutional Neural Networks. Sensors 19, 14 (2019). https://doi.org/10.3390/s19143079
  • Shaban-Nejad et al. (2016) Arash Shaban-Nejad, Maxime Lavigne, Anya Okhmatovskaia, and David Buckeridge. 2016. PopHR: a knowledge-based platform to support integration, analysis, and visualization of population health data: The Population Health Record (PopHR). Annals of the New York Academy of Sciences 1387 (10 2016). https://doi.org/10.1111/nyas.13271
  • Shao et al. (2022) Zezhi Shao, Zhao Zhang, Fei Wang, and Yongjun Xu. 2022. Pre-training Enhanced Spatial-temporal Graph Neural Network for Multivariate Time Series Forecasting. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM. https://doi.org/10.1145/3534678.3539396
  • Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention with Relative Position Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), Marilyn A. Walker, Heng Ji, and Amanda Stent (Eds.). Association for Computational Linguistics, 464–468. https://doi.org/10.18653/V1/N18-2074
  • Shukla and Marlin (2021) Satya Narayan Shukla and Benjamin M. Marlin. 2021. Multi-Time Attention Networks for Irregularly Sampled Time Series. arXiv:2101.10318[cs.LG]
  • Song et al. (2023) Yonghao Song, Qingqing Zheng, Bingchuan Liu, and Xiaorong Gao. 2023. EEG Conformer: Convolutional Transformer for EEG Decoding and Visualization. IEEE Transactions on Neural Systems and Rehabilitation Engineering 31 (2023), 710–719. https://doi.org/10.1109/TNSRE.2022.3230250
  • Song et al. (2022) Ziyang Song, Yuanyi Hu, Aman Verma, David L. Buckeridge, and Yue Li. 2022. Automatic Phenotyping by a Seed-guided Topic Model. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Washington DC, USA) (KDD ’22). Association for Computing Machinery, New York, NY, USA, 4713–4723. https://doi.org/10.1145/3534678.3542675
  • Stirling et al. (2020) Rachel Stirling, Mark Cook, David Grayden, and Pip Karoly. 2020. Seizure forecasting and cyclic control of seizures. Epilepsia 62 Suppl 1 (07 2020). https://doi.org/10.1111/epi.16541
  • Su et al. (2022) Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2022. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864[cs.CL]
  • Sun et al. (2023) Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. 2023. Retentive Network: A Successor to Transformer for Large Language Models. arXiv:2307.08621[cs.CL]
  • Sun et al. (2022) Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. 2022. A Length-Extrapolatable Transformer. arXiv:2212.10554[cs.CL]
  • Tang et al. (2021) Wensi Tang, Guodong Long, Lu Liu, Tianyi Zhou, Michael Blumenstein, and Jing Jiang. 2021. Omni-Scale CNNs: a simple and effective kernel size configuration for time series classification. In International Conference on Learning Representations.
  • Tolstikhin et al. (2021) Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. 2021. MLP-Mixer: An all-MLP Architecture for Vision. arXiv:2105.01601[cs.CV]
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971[cs.CL]
  • Vaswani et al. (2023) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762[cs.CL]
  • Woo et al. (2022) Gerald Woo, Chenghao Liu, Doyen Sahoo, Akshat Kumar, and Steven Hoi. 2022. ETSformer: Exponential Smoothing Transformers for Time-series Forecasting. arXiv:2202.01381[cs.LG]
  • Wu et al. (2023) Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. 2023. TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis. In International Conference on Learning Representations.
  • Wu et al. (2022) Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. 2022. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. arXiv:2106.13008[cs.LG]
  • Wu et al. (2020) Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. 2020. Lite Transformer with Long-Short Range Attention. arXiv:2004.11886[cs.CL]
  • Yang et al. (2020) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2020. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv:1906.08237[cs.CL]
  • Yuan et al. (2018) Mengru Yuan, Guido Powell, Maxime Lavigne, Anya Okhmatovskaia, and David Buckeridge. 2018. Initial Usability Evaluation of a Knowledge-Based Population Health Information System: The Population Health Record (PopHR). AMIA … Annual Symposium proceedings. AMIA Symposium 2017 (04 2018), 1878–1884.
  • Yue et al. (2022) Zhihan Yue, Yujing Wang, Juanyong Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong, and Bixiong Xu. 2022. TS2Vec: Towards Universal Representation of Time Series. Proceedings of the AAAI Conference on Artificial Intelligence 36 (Jun. 2022), 8980–8987. https://doi.org/10.1609/aaai.v36i8.20881
  • Zeng et al. (2022) Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. 2022. Are Transformers Effective for Time Series Forecasting? arXiv:2205.13504[cs.AI]
  • Zerveas et al. (2020) George Zerveas, Srideepika Jayaraman, Dhaval Patel, Anuradha Bhamidipaty, and Carsten Eickhoff. 2020. A Transformer-based Framework for Multivariate Time Series Representation Learning. arXiv:2010.02803[cs.LG]
  • Zhai et al. (2022) Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. 2022. Scaling Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12104–12113.
  • Zhang et al. (2023) Wenrui Zhang, Ling Yang, Shijia Geng, and Shenda Hong. 2023. Self-Supervised Time Series Representation Learning via Cross Reconstruction Transformer. arXiv:2205.09928[cs.LG]
  • Zhang et al. (2022) Xiang Zhang, Marko Zeman, Theodoros Tsiligkaridis, and Marinka Zitnik. 2022. Graph-Guided Network for Irregularly Sampled Multivariate Time Series. arXiv:2110.05357[cs.LG]
  • Zhao et al. (2022) Liang Zhao, Min Gao, and Zongwei Wang. 2022. ST-GSP: Spatial-Temporal Global Semantic Representation Learning for Urban Flow Prediction. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining (Virtual Event, AZ, USA) (WSDM ’22). Association for Computing Machinery, New York, NY, USA, 1443–1451. https://doi.org/10.1145/3488560.3498444
  • Zhou et al. (2021) Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 2021. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. arXiv:2012.07436[cs.LG]
  • Zhou et al. (2022) Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. 2022. FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting. arXiv:2201.12740[cs.LG]

Appendix A Revisiting Transformers

A.1. Efficient attention in Transformer

Transformer models have found extensive applications in both the Natural Language Processing and Computer Vision domains (Vaswani et al., 2023). In the vanilla self-attention mechanism, the query, key, value matrices are denoted as 𝑸,𝑲,𝑽∈ℝ N×d 𝑸 𝑲 𝑽 superscript ℝ 𝑁 𝑑{\bm{Q}},{\bm{K}},{\bm{V}}\in\mathbb{R}^{N\times d}bold_italic_Q , bold_italic_K , bold_italic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT. The output embedding for the n 𝑛 n italic_n-th token is represented as 𝑶 n=∑m N sim⁢(𝑸 n,𝑲 m)⁢𝑽 m∑m N sim⁢(𝑸 n,𝑲 m)subscript 𝑶 𝑛 superscript subscript 𝑚 𝑁 sim subscript 𝑸 𝑛 subscript 𝑲 𝑚 subscript 𝑽 𝑚 superscript subscript 𝑚 𝑁 sim subscript 𝑸 𝑛 subscript 𝑲 𝑚{\bm{O}}{n}=\frac{\sum{m}^{N}\text{sim}({\bm{Q}}{n},{\bm{K}}{m}){\bm{V}}{% m}}{\sum{m}^{N}\text{sim}({\bm{Q}}{n},{\bm{K}}{m})}bold_italic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT sim ( bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) bold_italic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT sim ( bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG, where the the similarity function represents the softmax of inner-product sim⁢(𝑸 n,𝑲 m)=exp⁡(𝑸 n⁢𝑲 m⊤/d)sim subscript 𝑸 𝑛 subscript 𝑲 𝑚 subscript 𝑸 𝑛 superscript subscript 𝑲 𝑚 top 𝑑\text{sim}({\bm{Q}}{n},{\bm{K}}{m})=\exp({\bm{Q}}{n}{\bm{K}}{m}^{\top}/% \sqrt{d})sim ( bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) = roman_exp ( bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / square-root start_ARG italic_d end_ARG ). The self-attention mechanism, also known as token-mixer, aims to integrate information from every token and thus capture global-range interaction. However, computing the dot product 𝑸 n⁢𝑲 m⊤subscript 𝑸 𝑛 superscript subscript 𝑲 𝑚 top{\bm{Q}}{n}{\bm{K}}{m}^{\top}bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT before the softmax operation introduces computational complexity of O⁢(N 2⁢d)𝑂 superscript 𝑁 2 𝑑 O(N^{2}d)italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ). As sequence length increases, this quadratic complexity becomes bottleneck, making it challenging to train for longer sequences. Many studies have been proposed to address the quadratic issue in self-attention mechanism. The linear attention replaces the softmax term sim⁢(𝑸 m,𝑲 m)sim subscript 𝑸 𝑚 subscript 𝑲 𝑚\text{sim}({\bm{Q}}{m},{\bm{K}}{m})sim ( bold_italic_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) with ϕ⁢(𝑸 n)⁢ϕ⁢(𝑲 m⊤)italic-ϕ subscript 𝑸 𝑛 italic-ϕ superscript subscript 𝑲 𝑚 top\phi({\bm{Q}}{n})\phi({\bm{K}}{m}^{\top})italic_ϕ ( bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_ϕ ( bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) for a nonlinear kernel function ϕ⁢(⋅)italic-ϕ⋅\phi(\cdot)italic_ϕ ( ⋅ )(Katharopoulos et al., 2020), avoiding quadratic computation.

Recent research has explored alternatives to the token-mixer attention mechanism including Multi-Layer Perceptron (MLP) (Tolstikhin et al., 2021), convolution (Poli et al., 2023), and RNN (Peng et al., 2023; Sun et al., 2023). Particularly, RNN-variant models like RWKV and RetNet have successfully scaled up to more than 14 billion parameters, yielding comparable performance to conventional transformers. A fascinating connection between linear attention and RNNs has been identified (Katharopoulos et al., 2020), making RNN-based token mixer as efficient as linear attention. The output embedding from linear attention can be recast as an RNN: 𝑶 n=ϕ⁢(𝑸 n)⁢∑m N ϕ⁢(𝑲 m⊤)⁢𝑽 m ϕ⁢(𝑸 n)⁢∑m N ϕ⁢(𝑲 m⊤)=ϕ⁢(𝑸 n)⁢𝑺 n ϕ⁢(𝑸 n)⁢𝒁 n subscript 𝑶 𝑛 italic-ϕ subscript 𝑸 𝑛 superscript subscript 𝑚 𝑁 italic-ϕ superscript subscript 𝑲 𝑚 top subscript 𝑽 𝑚 italic-ϕ subscript 𝑸 𝑛 superscript subscript 𝑚 𝑁 italic-ϕ superscript subscript 𝑲 𝑚 top italic-ϕ subscript 𝑸 𝑛 subscript 𝑺 𝑛 italic-ϕ subscript 𝑸 𝑛 subscript 𝒁 𝑛{\bm{O}}{n}=\frac{\phi({\bm{Q}}{n})\sum_{m}^{N}\phi({\bm{K}}{m}^{\top}){\bm% {V}}{m}}{\phi({\bm{Q}}{n})\sum{m}^{N}\phi({\bm{K}}{m}^{\top})}=\frac{\phi(% {\bm{Q}}{n}){\bm{S}}{n}}{\phi({\bm{Q}}{n}){\bm{Z}}{n}}bold_italic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG italic_ϕ ( bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ϕ ( bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_italic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG italic_ϕ ( bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ϕ ( bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) end_ARG = divide start_ARG italic_ϕ ( bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) bold_italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_ϕ ( bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) bold_italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG, where 𝑺 n=∑m N ϕ⁢(𝑲 m⊤)⁢𝑽 m,𝒁 n=∑m N ϕ⁢(𝑲 m⊤)formulae-sequence subscript 𝑺 𝑛 superscript subscript 𝑚 𝑁 italic-ϕ superscript subscript 𝑲 𝑚 top subscript 𝑽 𝑚 subscript 𝒁 𝑛 superscript subscript 𝑚 𝑁 italic-ϕ superscript subscript 𝑲 𝑚 top{\bm{S}}{n}=\sum_{m}^{N}\phi({\bm{K}}{m}^{\top}){\bm{V}}{m},,{\bm{Z}}{n}=% \sum{m}^{N}\phi({\bm{K}}{m}^{\top})bold_italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ϕ ( bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_italic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ϕ ( bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ). Thus, the output embedding 𝑶 n subscript 𝑶 𝑛{\bm{O}}{n}bold_italic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT depends on both 𝑺 n subscript 𝑺 𝑛{\bm{S}}{n}bold_italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and 𝒁 n subscript 𝒁 𝑛{\bm{Z}}{n}bold_italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, which are incrementally updated through cumulative sums. Thus, the RNN-based token-mixer not only competes in performance, but also offers linear training and consistent inference complexities. By employing exponential decay mechanism, it diminishes the influence of distant positions, transitioning from “token-mixing” to “time-mixing”. Considering RNN’s historical effectiveness in time-series and audio domains, it stands out as an excellent choice for temporal modeling.

A.2. Time-series Transformer

Transformers are increasingly applied in LTSF tasks, attributed to their capabilities in capturing long-term temporal dependencies (Zhou et al., 2021; Wu et al., 2022; Zhou et al., 2022; Woo et al., 2022; Nie et al., 2023). Researchers have modified transformers by incorporating custom attention modules to address complex temporal dependencies (Zhou et al., 2021; Wu et al., 2022; Zhou et al., 2022). Studies like (Wu et al., 2022; Zhou et al., 2022; Woo et al., 2022) have introduced time-decomposition techniques into attention mechanisms to bolster modeling capability. The majority of studies focus on the encoder-decoder architecture, coupled with a one-forward prediction framework (Zhou et al., 2021). In this design, the decoder takes a concatenated input of the context (or prompt) and placeholder forecasting windows, directly generating the resulting embedding without autoregressive decoding. As a result, these models aim to avoid error accumulation seen in autoregressive frameworks, but aligning its performance closely with linear models (Zeng et al., 2022). Encoder-only models, like patchTST, use the encoded embedding for forecasting with the help of a linear layer (Nie et al., 2023). Additionally, self-supervised representation learning techniques in time series, such as TS2Vec and TimesNet, offer valuable representation learning capabilities for forecasting tasks (Yue et al., 2022; Wu et al., 2023).

A.3. Transformer scaling law in time-series

Despite the broad applications of transformer-based models in time-series data such as speech (Gulati et al., 2020; Radford et al., 2022), biosignals (Song et al., 2023), and traffic flow (Shao et al., 2022; Jiang et al., 2023), their effectiveness in capturing temporal dependencies in LTSF task has been limited and often underperforms compared to linear models (Zeng et al., 2022). As TableS1 indicates, time-series transformer models often have much more parameters than the dataset size (timestep) with only two exceptions, namely large-size Conformer and CRT. Such disparities imply that many transformers may be over-parameterized, leading to highly variable performance. In Section 6.1, our study validates the Transformer scaling law in time-series domain (i.e., scaling up both model parameters and dataset size to improve performance) (Kaplan et al., 2020; Zhai et al., 2022). For all benchmark experiments, our proposed TimelyGPT effectively pre-trains on large-scale data with model parameters aligned to this scaling law.

Table S1. The model parameters and utilized datasets of time-series transformers and comparison methods. These setups are sourced from papers and default implementation. Over-parameterization indicates model parameters >>much-greater-than>>>> dataset size (timestep).

Image 8: Refer to caption

Figure S1. The xPos embedding diminishes distant temporal information according to the relative distance, enabling decomposition to capture both trend and periodic dynamics in time-series data.

Appendix B Details about TimelyGPT

B.1. From absolute to relative position embedding

Unlike RNNs or CNNs, the inclusion of positional embedding is essential for the Transformer model. Since the permutation-invariant self-attention mechanism cannot capture input order, making it challenging to differentiate tokens in various positions. The solution fall into two categories: (1) incorporate position information into the inputs, i.e., absolute position embedding; (2) modify the attention matrix to distinguish tokens at different positions, referring to relative position embedding.

In absolute position embedding, the token representation for a given token n 𝑛 n italic_n consists of a word embedding 𝑿 n subscript 𝑿 𝑛{\bm{X}}{n}bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and a position embedding 𝑷 n subscript 𝑷 𝑛{\bm{P}}{n}bold_italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. The self-attention mechanism is expressed as:

𝑸 n=(𝑿 n+𝑷 n)⁢𝑾 Q,𝑲 n=(𝑿 n+𝑷 n)⁢𝑾 K,𝑽 n=(𝑿 n+𝑷 n)⁢𝑾 V formulae-sequence subscript 𝑸 𝑛 subscript 𝑿 𝑛 subscript 𝑷 𝑛 subscript 𝑾 𝑄 formulae-sequence subscript 𝑲 𝑛 subscript 𝑿 𝑛 subscript 𝑷 𝑛 subscript 𝑾 𝐾 subscript 𝑽 𝑛 subscript 𝑿 𝑛 subscript 𝑷 𝑛 subscript 𝑾 𝑉\displaystyle{\bm{Q}}{n}=({\bm{X}}{n}+{\bm{P}}{n}){\bm{W}}{Q},\quad{\bm{K}% }{n}=({\bm{X}}{n}+{\bm{P}}{n}){\bm{W}}{K},\quad{\bm{V}}{n}=({\bm{X}}{n}+% {\bm{P}}{n}){\bm{W}}{V}bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , bold_italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , bold_italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT (6)𝑨 n,m=softmax⁢(𝑸 n⁢𝑲 m⊤),𝑶 m=∑m 𝑨 n,m⁢𝑽 m formulae-sequence subscript 𝑨 𝑛 𝑚 softmax subscript 𝑸 𝑛 superscript subscript 𝑲 𝑚 top subscript 𝑶 𝑚 subscript 𝑚 subscript 𝑨 𝑛 𝑚 subscript 𝑽 𝑚\displaystyle{\bm{A}}{n,m}=\text{softmax}({\bm{Q}}{n}{\bm{K}}{m}^{\top}),% \quad{\bm{O}}{m}=\sum_{m}{\bm{A}}{n,m}{\bm{V}}{m}bold_italic_A start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT = softmax ( bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) , bold_italic_O start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT bold_italic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT

where 𝑨 n,m subscript 𝑨 𝑛 𝑚{\bm{A}}{n,m}bold_italic_A start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT is an attention score between token n 𝑛 n italic_n and m 𝑚 m italic_m without scaling. The inner-dot product 𝑸 n⁢𝑲 m⊤subscript 𝑸 𝑛 superscript subscript 𝑲 𝑚 top{\bm{Q}}{n}{\bm{K}}{m}^{\top}bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and output embedding 𝑶 m subscript 𝑶 𝑚{\bm{O}}{m}bold_italic_O start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT can be expanded as follows:

𝑸 n⁢𝑲 m⊤=(𝑿 n+𝑷 n)⁢𝑾 Q⁢((𝑿 m+𝑷 m)⁢𝑾 K)⊤subscript 𝑸 𝑛 superscript subscript 𝑲 𝑚 top subscript 𝑿 𝑛 subscript 𝑷 𝑛 subscript 𝑾 𝑄 superscript subscript 𝑿 𝑚 subscript 𝑷 𝑚 subscript 𝑾 𝐾 top\displaystyle{\bm{Q}}{n}{\bm{K}}{m}^{\top}=({\bm{X}}{n}+{\bm{P}}{n}){\bm{W% }}{Q}(({\bm{X}}{m}+{\bm{P}}{m}){\bm{W}}{K})^{\top}bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = ( bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( ( bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + bold_italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT =(𝑿 n+𝑷 n)⁢𝑾 Q⁢𝑾 K⊤⁢(𝑿 m+𝑷 m)⊤absent subscript 𝑿 𝑛 subscript 𝑷 𝑛 subscript 𝑾 𝑄 superscript subscript 𝑾 𝐾 top superscript subscript 𝑿 𝑚 subscript 𝑷 𝑚 top\displaystyle=({\bm{X}}{n}+{\bm{P}}{n}){\bm{W}}{Q}{\bm{W}}{K}^{\top}({\bm{% X}}{m}+{\bm{P}}{m})^{\top}= ( bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + bold_italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (7)=𝑿 n⁢𝑾 Q⁢𝑾 K⊤⁢𝑿 m⊤⏟token-token+𝑿 n⁢𝑾 Q⁢𝑾 K⊤⁢𝑷 m⊤⏟token-position+𝑷 n⁢𝑾 Q⁢𝑾 K⊤⁢𝑿 m⊤⏟position-token+𝑷 n⁢𝑾 Q⁢𝑾 K⊤⁢𝑷 m⊤⏟position-position absent subscript⏟subscript 𝑿 𝑛 subscript 𝑾 𝑄 superscript subscript 𝑾 𝐾 top superscript subscript 𝑿 𝑚 top token-token subscript⏟subscript 𝑿 𝑛 subscript 𝑾 𝑄 superscript subscript 𝑾 𝐾 top superscript subscript 𝑷 𝑚 top token-position subscript⏟subscript 𝑷 𝑛 subscript 𝑾 𝑄 superscript subscript 𝑾 𝐾 top superscript subscript 𝑿 𝑚 top position-token subscript⏟subscript 𝑷 𝑛 subscript 𝑾 𝑄 superscript subscript 𝑾 𝐾 top superscript subscript 𝑷 𝑚 top position-position\displaystyle=\underbrace{{\bm{X}}{n}{\bm{W}}{Q}{\bm{W}}{K}^{\top}{\bm{X}}% {m}^{\top}}{\text{token-token}}+\underbrace{{\bm{X}}{n}{\bm{W}}{Q}{\bm{W}}% {K}^{\top}{\bm{P}}{m}^{\top}}{\text{token-position}}+\underbrace{{\bm{P}}{n% }{\bm{W}}{Q}{\bm{W}}{K}^{\top}{\bm{X}}{m}^{\top}}{\text{position-token}}+% \underbrace{{\bm{P}}{n}{\bm{W}}{Q}{\bm{W}}{K}^{\top}{\bm{P}}{m}^{\top}}{% \text{position-position}}= under⏟ start_ARG bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT token-token end_POSTSUBSCRIPT + under⏟ start_ARG bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT token-position end_POSTSUBSCRIPT + under⏟ start_ARG bold_italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT position-token end_POSTSUBSCRIPT + under⏟ start_ARG bold_italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT position-position end_POSTSUBSCRIPT

(8)𝑶 n=∑m softmax⁢((𝑿 n⁢𝑾 Q+𝑷 n⁢𝑾 Q)⁢(𝑾 K⊤⁢𝑿 m⊤+𝑾 K⊤⁢𝑷 m⊤))⁢(𝑿 n+𝑷 m)⁢𝑾 V subscript 𝑶 𝑛 subscript 𝑚 softmax subscript 𝑿 𝑛 subscript 𝑾 𝑄 subscript 𝑷 𝑛 subscript 𝑾 𝑄 superscript subscript 𝑾 𝐾 top superscript subscript 𝑿 𝑚 top superscript subscript 𝑾 𝐾 top superscript subscript 𝑷 𝑚 top subscript 𝑿 𝑛 subscript 𝑷 𝑚 subscript 𝑾 𝑉{\bm{O}}{n}=\sum{m}\text{softmax}\left(({\bm{X}}{n}{\bm{W}}{Q}+{\bm{P}}{n% }{\bm{W}}{Q})({\bm{W}}{K}^{\top}{\bm{X}}{m}^{\top}+{\bm{W}}{K}^{\top}{\bm{% P}}{m}^{\top})\right)({\bm{X}}{n}+{\bm{P}}{m}){\bm{W}}_{V}bold_italic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT softmax ( ( bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT + bold_italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) ( bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ) ( bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT

where attention arises from four types of interactions: (1) token-token interaction; (2) token-position interaction; (3) position-token interaction; (4) position-position interaction. However, absolute position embedding only incorporates fixed position information, neglecting the relative positional difference between the token n 𝑛 n italic_n and m 𝑚 m italic_m.

In the realm of audio processing, prevalent transformers like Conformer (Gulati et al., 2020) incorporate relative positional information through the T5 position embedding (Raffel et al., 2020). Notably, the T5 model suggests a minimal interaction between tokens and positions, resulting in the exclusion of token-position and position-token terms from the attention matrix:

(9)𝑸 n⁢𝑲 m⊤subscript 𝑸 𝑛 superscript subscript 𝑲 𝑚 top\displaystyle{\bm{Q}}{n}{\bm{K}}{m}^{\top}bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT=𝑿 n⁢𝑾 Q⁢𝑾 K⊤⁢𝑿 m⊤+β n,m absent subscript 𝑿 𝑛 subscript 𝑾 𝑄 superscript subscript 𝑾 𝐾 top superscript subscript 𝑿 𝑚 top subscript 𝛽 𝑛 𝑚\displaystyle={\bm{X}}{n}{\bm{W}}{Q}{\bm{W}}{K}^{\top}{\bm{X}}{m}^{\top}+{% \color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\beta_{n,m}}= bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_β start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT

where the position-position interaction term, 𝑷 n⁢𝑾 Q⁢𝑾 K⊤⁢𝑷 m⊤subscript 𝑷 𝑛 subscript 𝑾 𝑄 superscript subscript 𝑾 𝐾 top superscript subscript 𝑷 𝑚 top{\bm{P}}{n}{\bm{W}}{Q}{\bm{W}}{K}^{\top}{\bm{P}}{m}^{\top}bold_italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, is replaced with a trainable bias related to the position n 𝑛 n italic_n and m 𝑚 m italic_m. The T5 position embedding follows Transformer-XL, omitting the position term 𝑷 m⁢𝑾 V subscript 𝑷 𝑚 subscript 𝑾 𝑉{\bm{P}}{m}{\bm{W}}{V}bold_italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT in the attentive aggregation computation (Dai et al., 2019; Yang et al., 2020). As a result, the relative position embedding is only added to the dot product 𝑸⁢𝑲⊤𝑸 superscript 𝑲 top{\bm{Q}}{\bm{K}}^{\top}bold_italic_Q bold_italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT:

(10)𝑶 n=∑m softmax⁢(𝑿 n⁢𝑾 Q⁢𝑾 K⊤⁢𝑿 m⊤+β n,m)⁢𝑿 m⁢𝑾 V subscript 𝑶 𝑛 subscript 𝑚 softmax subscript 𝑿 𝑛 subscript 𝑾 𝑄 superscript subscript 𝑾 𝐾 top superscript subscript 𝑿 𝑚 top subscript 𝛽 𝑛 𝑚 subscript 𝑿 𝑚 subscript 𝑾 𝑉{\bm{O}}{n}=\sum{m}\text{softmax}({\bm{X}}{n}{\bm{W}}{Q}{\bm{W}}{K}^{\top% }{\bm{X}}{m}^{\top}+{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{% rgb}{1,0,0}\beta_{n,m}}){\bm{X}}{m}{\bm{W}}{V}bold_italic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT softmax ( bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_β start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ) bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT

The RoPE technique leverages the property of rotation matrix to model positional information (Su et al., 2022). To incorporate this relative position information into the queries 𝑸 𝑸{\bm{Q}}bold_italic_Q and keys 𝑲 𝑲{\bm{K}}bold_italic_K, the method aims to identify functions f 𝑸⁢(𝑸,⋅)subscript 𝑓 𝑸 𝑸⋅f_{{\bm{Q}}}({\bm{Q}},\cdot)italic_f start_POSTSUBSCRIPT bold_italic_Q end_POSTSUBSCRIPT ( bold_italic_Q , ⋅ ) and f K⁢(𝑲,⋅)subscript 𝑓 𝐾 𝑲⋅f_{K}({\bm{K}},\cdot)italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( bold_italic_K , ⋅ ) that satisfies this invariant criteria about relative distance:

(11)⟨𝑸 n,𝑲 m⟩=⟨f 𝑸⁢(𝑸,n),f K⁢(𝑲,m)⟩=g⁢(𝑸,𝑲,m−n),subscript 𝑸 𝑛 subscript 𝑲 𝑚 subscript 𝑓 𝑸 𝑸 𝑛 subscript 𝑓 𝐾 𝑲 𝑚 𝑔 𝑸 𝑲 𝑚 𝑛\left\langle{\bm{Q}}{n},{\bm{K}}{m}\right\rangle=\left\langle f_{{\bm{Q}}}({% \bm{Q}},n),f_{K}({\bm{K}},m)\right\rangle=g({\bm{Q}},{\bm{K}},m-n),⟨ bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ = ⟨ italic_f start_POSTSUBSCRIPT bold_italic_Q end_POSTSUBSCRIPT ( bold_italic_Q , italic_n ) , italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( bold_italic_K , italic_m ) ⟩ = italic_g ( bold_italic_Q , bold_italic_K , italic_m - italic_n ) ,

where g 𝑔 g italic_g is a function that depends only on the relative distance m−n 𝑚 𝑛 m-n italic_m - italic_n and 𝑸=𝑿⁢𝑾 Q 𝑸 𝑿 subscript 𝑾 𝑄{\bm{Q}}={\bm{X}}{\bm{W}}{Q}bold_italic_Q = bold_italic_X bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT and 𝑲=𝑿⁢𝑾 K 𝑲 𝑿 subscript 𝑾 𝐾{\bm{K}}={\bm{X}}{\bm{W}}{K}bold_italic_K = bold_italic_X bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT stand for token embedding for queries and keys matrices, respectively. RoPE defines the function f 𝑓 f italic_f involving a d 𝑑 d italic_d-dimensional rotation matrix 𝑹 𝑹{\bm{R}}bold_italic_R:

(12)f 𝑸⁢(𝑸,n)=𝑹 Θ,n d⁢(𝑿 n⁢𝑾 Q),f 𝑲⁢(𝑲,m)=𝑹 Θ,m d⁢(𝑿 m⁢𝑾 K)formulae-sequence subscript 𝑓 𝑸 𝑸 𝑛 subscript superscript 𝑹 𝑑 Θ 𝑛 subscript 𝑿 𝑛 subscript 𝑾 𝑄 subscript 𝑓 𝑲 𝑲 𝑚 subscript superscript 𝑹 𝑑 Θ 𝑚 subscript 𝑿 𝑚 subscript 𝑾 𝐾 f_{{\bm{Q}}}({\bm{Q}},n)={\bm{R}}^{d}{\Theta,n}({\bm{X}}{n}{\bm{W}}{Q}),% \quad f{{\bm{K}}}({\bm{K}},m)={\bm{R}}^{d}{\Theta,m}({\bm{X}}{m}{\bm{W}}_{K})italic_f start_POSTSUBSCRIPT bold_italic_Q end_POSTSUBSCRIPT ( bold_italic_Q , italic_n ) = bold_italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Θ , italic_n end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) , italic_f start_POSTSUBSCRIPT bold_italic_K end_POSTSUBSCRIPT ( bold_italic_K , italic_m ) = bold_italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Θ , italic_m end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT )

With a given hidden size d 𝑑 d italic_d, a block diagonal matrix 𝑹 Θ,n d subscript superscript 𝑹 𝑑 Θ 𝑛{\bm{R}}^{d}{\Theta,n}bold_italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Θ , italic_n end_POSTSUBSCRIPT contains multiple rotation matrices (𝑹 n,θ 1(1),…,𝑹 n,θ d/2(d/2))subscript superscript 𝑹 1 𝑛 subscript 𝜃 1…subscript superscript 𝑹 𝑑 2 𝑛 subscript 𝜃 𝑑 2({\bm{R}}^{(1)}{n,\theta_{1}},\dots,{\bm{R}}^{(d/2)}{n,\theta{d/2}})( bold_italic_R start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , bold_italic_R start_POSTSUPERSCRIPT ( italic_d / 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_θ start_POSTSUBSCRIPT italic_d / 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) on its diagonal:

(13)𝑹 Θ,n d=[𝑹 n,θ 1(1)⋱𝑹 n,θ d/2(d/2)],𝑹 n,θ i(i)=[cos⁢n⁢θ i−sin⁢n⁢θ i sin⁢n⁢θ i cos⁢n⁢θ i]formulae-sequence subscript superscript 𝑹 𝑑 Θ 𝑛 matrix subscript superscript 𝑹 1 𝑛 subscript 𝜃 1 missing-subexpression missing-subexpression missing-subexpression⋱missing-subexpression missing-subexpression missing-subexpression subscript superscript 𝑹 𝑑 2 𝑛 subscript 𝜃 𝑑 2 subscript superscript 𝑹 𝑖 𝑛 subscript 𝜃 𝑖 matrix cos 𝑛 subscript 𝜃 𝑖 sin 𝑛 subscript 𝜃 𝑖 sin 𝑛 subscript 𝜃 𝑖 cos 𝑛 subscript 𝜃 𝑖\displaystyle{\bm{R}}^{d}{\Theta,n}=\begin{bmatrix}{\bm{R}}^{(1)}{n,\theta_{% 1}}&&\ &\ddots&\ &&{\bm{R}}^{(d/2)}{n,\theta{d/2}}\end{bmatrix},\quad{\bm{R}}^{(i)}{n,\theta% {i}}=\begin{bmatrix}\text{cos}\ n\theta{i}&-\text{sin}\ n\theta{i}\ \text{sin}\ n\theta_{i}&\text{cos}\ n\theta_{i}\end{bmatrix}bold_italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Θ , italic_n end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL bold_italic_R start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL bold_italic_R start_POSTSUPERSCRIPT ( italic_d / 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_θ start_POSTSUBSCRIPT italic_d / 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] , bold_italic_R start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL cos italic_n italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL - sin italic_n italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL sin italic_n italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL cos italic_n italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]

where the rotation hyperparameter θ i=10000−2⁢(i−1)/d subscript 𝜃 𝑖 superscript 10000 2 𝑖 1 𝑑\theta_{i}=10000^{-2(i-1)/d}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 10000 start_POSTSUPERSCRIPT - 2 ( italic_i - 1 ) / italic_d end_POSTSUPERSCRIPT. In RoPE, any even-dimension representation can be built by placing multiple 2-dimensional rotation matrices diagonally within the 𝑹 Θ,n d subscript superscript 𝑹 𝑑 Θ 𝑛{\bm{R}}^{d}{\Theta,n}bold_italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Θ , italic_n end_POSTSUBSCRIPT matrix, expanding hidden size from 2-dimension to d 𝑑 d italic_d-dimension. As 𝑹 Θ,m−n d=(𝑹 Θ,n d)⊤⁢𝑹 Θ,m d subscript superscript 𝑹 𝑑 Θ 𝑚 𝑛 superscript subscript superscript 𝑹 𝑑 Θ 𝑛 top subscript superscript 𝑹 𝑑 Θ 𝑚{\bm{R}}^{d}{\Theta,m-n}=({\bm{R}}^{d}{\Theta,n})^{\top}{\bm{R}}^{d}{\Theta% ,m}bold_italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Θ , italic_m - italic_n end_POSTSUBSCRIPT = ( bold_italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Θ , italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Θ , italic_m end_POSTSUBSCRIPT, RoPE satisfies the property outlined in Eq 11:

⟨𝑸 n,𝑲 m⟩=∑i=1 d/2⟨𝑸 n[2 i−1:2 i],𝑲 m[2 i−1:2 i]⟩\displaystyle\left\langle{\bm{Q}}{n},{\bm{K}}{m}\right\rangle=\sum_{i=1}^{d/% 2}\left\langle{\bm{Q}}{n}[2i-1:2i],{\bm{K}}{m}[2i-1:2i]\right\rangle⟨ bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT ⟨ bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ 2 italic_i - 1 : 2 italic_i ] , bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT [ 2 italic_i - 1 : 2 italic_i ] ⟩ (14)=∑i=1 d/2 𝑹 θ i,m−n d⟨(𝑿 n 𝑾 Q)[2 i−1:2 i],(𝑿 m 𝑾 K)[2 i−1:2 i]⟩\displaystyle=\sum_{i=1}^{d/2}{\bm{R}}^{d}{\theta{i},m-n}\left\langle({\bm{X% }}{n}{\bm{W}}{Q})[2i-1:2i],({\bm{X}}{m}{\bm{W}}{K})[2i-1:2i]\right\rangle= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT bold_italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_m - italic_n end_POSTSUBSCRIPT ⟨ ( bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) [ 2 italic_i - 1 : 2 italic_i ] , ( bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) [ 2 italic_i - 1 : 2 italic_i ] ⟩

In RoPE, relative position information is added to the inner product 𝑸⁢𝑲⊤𝑸 superscript 𝑲 top{\bm{Q}}{\bm{K}}^{\top}bold_italic_Q bold_italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT by rotating the angles of queries and keys matrices. Recently, (Sun et al., 2022) argues that the sinusoids used in the rotation matrices do not change monotonically. Instead, they oscillate dramatically as the relative distance increases. This limitation hinders RoPE’s ability to sequences of extended lengths. To address it, (Sun et al., 2022) proposes xPos that preserves the advantage of ROPE and behaves stably at long-term dependency by measuring position monotonicity (Sun et al., 2022).

B.2. Equivalence of three forward-pass Retention

According to Section 3.2, the parallel forward-pass is equivalent to the recurrent forward-pass. With the initial state variable 𝑺 0=0 subscript 𝑺 0 0{\bm{S}}_{0}=0 bold_italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0, the recurrent forward-pass can be expressed as follows:

Recurrent:⁢𝑺 n=𝑲 n⊤⁢𝑽 n⏟Single-token+γ⁢𝑺 n−1,Ret⁢(𝑿 n)=𝑸 n⁢𝑺 n formulae-sequence Recurrent:subscript 𝑺 𝑛 subscript⏟superscript subscript 𝑲 𝑛 top subscript 𝑽 𝑛 Single-token 𝛾 subscript 𝑺 𝑛 1 Ret subscript 𝑿 𝑛 subscript 𝑸 𝑛 subscript 𝑺 𝑛\displaystyle\textbf{Recurrent: }{\bm{S}}{n}=\underbrace{{\bm{K}}{n}^{\top}{% \bm{V}}{n}}{\text{Single-token}}+\gamma{\bm{S}}{n-1},\quad\text{Ret}({\bm{X% }}{n})={\bm{Q}}{n}{\bm{S}}{n}Recurrent: bold_italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = under⏟ start_ARG bold_italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Single-token end_POSTSUBSCRIPT + italic_γ bold_italic_S start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , Ret ( bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (15)⟹𝑺 n=∑m n γ n−m⁢𝑲 m⊤⁢𝑽 m,Ret⁢(𝑿 n)=𝑸 n⁢∑m n γ n−m⁢𝑲 m⊤⁢𝑽 m formulae-sequence absent subscript 𝑺 𝑛 superscript subscript 𝑚 𝑛 superscript 𝛾 𝑛 𝑚 superscript subscript 𝑲 𝑚 top subscript 𝑽 𝑚 Ret subscript 𝑿 𝑛 subscript 𝑸 𝑛 superscript subscript 𝑚 𝑛 superscript 𝛾 𝑛 𝑚 superscript subscript 𝑲 𝑚 top subscript 𝑽 𝑚\displaystyle\implies{\bm{S}}{n}=\sum{m}^{n}\gamma^{n-m}{\bm{K}}{m}^{\top}{% \bm{V}}{m},\quad\text{Ret}({\bm{X}}{n})={\bm{Q}}{n}\sum_{m}^{n}\gamma^{n-m}% {\bm{K}}{m}^{\top}{\bm{V}}{m}⟹ bold_italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_n - italic_m end_POSTSUPERSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , Ret ( bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_n - italic_m end_POSTSUPERSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT

where Ret⁢(𝑿 n)Ret subscript 𝑿 𝑛\text{Ret}({\bm{X}}_{n})Ret ( bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) calculates the Retention at single-time n 𝑛 n italic_n by considering timestep i 𝑖 i italic_i up to the current time. It corresponds to the n 𝑛 n italic_n-th timestep (row) of parallel forward-pass of Retention.

Recurrent:Ret⁢(𝑿 n)=𝑸 n⁢∑m n γ n−m⁢𝑲 m⊤⁢𝑽 m Recurrent:Ret subscript 𝑿 𝑛 subscript 𝑸 𝑛 superscript subscript 𝑚 𝑛 superscript 𝛾 𝑛 𝑚 superscript subscript 𝑲 𝑚 top subscript 𝑽 𝑚\displaystyle\textbf{Recurrent: }\text{Ret}({\bm{X}}{n})={\bm{Q}}{n}\sum_{m}% ^{n}\gamma^{n-m}{\bm{K}}{m}^{\top}{\bm{V}}{m}bold_Recurrent: roman_Ret ( bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_n - italic_m end_POSTSUPERSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (16)⟹Parallel:Ret⁢(𝑿 n)=𝑸 n⏟1×d q⁢k⁢𝑲 m≤n⊤⏟d q⁢k×n⁢⊙𝑫 m≤n⏟n×n⁢𝑽 m≤n⏟n×d v absent Parallel:Ret subscript 𝑿 𝑛 subscript⏟subscript 𝑸 𝑛 1 subscript 𝑑 𝑞 𝑘 subscript⏟superscript subscript 𝑲 𝑚 𝑛 top subscript 𝑑 𝑞 𝑘 𝑛 subscript⏟direct-product absent subscript 𝑫 𝑚 𝑛 𝑛 𝑛 subscript⏟subscript 𝑽 𝑚 𝑛 𝑛 subscript 𝑑 𝑣\displaystyle\implies\textbf{Parallel: }\text{Ret}({\bm{X}}{n})=\underbrace{{% \bm{Q}}{n}}{1\times d{qk}}\underbrace{{\bm{K}}{m\leq n}^{\top}}{d_{qk}% \times n}\underbrace{\odot{\bm{D}}{m\leq n}}{n\times n}\underbrace{{\bm{V}}% {m\leq n}}{n\times d_{v}}⟹ bold_Parallel: roman_Ret ( bold_italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = under⏟ start_ARG bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_q italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT under⏟ start_ARG bold_italic_K start_POSTSUBSCRIPT italic_m ≤ italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_q italic_k end_POSTSUBSCRIPT × italic_n end_POSTSUBSCRIPT under⏟ start_ARG ⊙ bold_italic_D start_POSTSUBSCRIPT italic_m ≤ italic_n end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT under⏟ start_ARG bold_italic_V start_POSTSUBSCRIPT italic_m ≤ italic_n end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT italic_n × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT

When the recurrent forward-pass traverses all timesteps, the parallel and recurrent forward-passes of Retention become identical. With the parallel and recurrent forward-passes of Retention, we aim to show the equivalence between the chunk-wise forward-pass and the parallel and recurrent forward-passes. The computation of chunk-wise Retention involves both parallel intra-chunk and recurrent inter-chunk computation as follows.

Chunk-wise:Ret⁢(𝑿[i])=(𝑸[i]⁢𝑲[i]⊤⊙𝑫)⁢𝑽[i]⏟Intra-chunk+(𝑸[i]⁢𝑺[i−1])⊙ζ⏟Inter-chunk Chunk-wise:Ret subscript 𝑿 delimited-[]𝑖 subscript⏟direct-product subscript 𝑸 delimited-[]𝑖 superscript subscript 𝑲 delimited-[]𝑖 top 𝑫 subscript 𝑽 delimited-[]𝑖 Intra-chunk subscript⏟direct-product subscript 𝑸 delimited-[]𝑖 subscript 𝑺 delimited-[]𝑖 1 𝜁 Inter-chunk\displaystyle\textbf{Chunk-wise: }\text{Ret}({\bm{X}}{[i]})=\underbrace{({\bm% {Q}}{[i]}{\bm{K}}{[i]}^{\top}\odot{\bm{D}}){\bm{V}}{[i]}}{\text{Intra-% chunk}}+\underbrace{({\bm{Q}}{[i]}{\bm{S}}{[i-1]})\odot\zeta}{\text{Inter-% chunk}}bold_Chunk-wise: roman_Ret ( bold_italic_X start_POSTSUBSCRIPT [ italic_i ] end_POSTSUBSCRIPT ) = under⏟ start_ARG ( bold_italic_Q start_POSTSUBSCRIPT [ italic_i ] end_POSTSUBSCRIPT bold_italic_K start_POSTSUBSCRIPT [ italic_i ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⊙ bold_italic_D ) bold_italic_V start_POSTSUBSCRIPT [ italic_i ] end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Intra-chunk end_POSTSUBSCRIPT + under⏟ start_ARG ( bold_italic_Q start_POSTSUBSCRIPT [ italic_i ] end_POSTSUBSCRIPT bold_italic_S start_POSTSUBSCRIPT [ italic_i - 1 ] end_POSTSUBSCRIPT ) ⊙ italic_ζ end_ARG start_POSTSUBSCRIPT Inter-chunk end_POSTSUBSCRIPT (17)𝑺[i]=𝑲[i]⊤⁢(𝑽[i]⊙𝑫 B)⏟Current chunk+γ B⁢𝑺[i−1]⏟Past chunk,ζ j=γ j formulae-sequence subscript 𝑺 delimited-[]𝑖 subscript⏟superscript subscript 𝑲 delimited-[]𝑖 top direct-product subscript 𝑽 delimited-[]𝑖 subscript 𝑫 𝐵 Current chunk subscript⏟superscript 𝛾 𝐵 subscript 𝑺 delimited-[]𝑖 1 Past chunk subscript 𝜁 𝑗 superscript 𝛾 𝑗\displaystyle{\bm{S}}{[i]}=\underbrace{{\bm{K}}{[i]}^{\top}({\bm{V}}{[i]}% \odot{\bm{D}}{B})}{\text{Current chunk}}+\underbrace{\gamma^{B}{\bm{S}}{[i-% 1]}}{\text{Past chunk}},,\zeta{j}=\gamma^{j}bold_italic_S start_POSTSUBSCRIPT [ italic_i ] end_POSTSUBSCRIPT = under⏟ start_ARG bold_italic_K start_POSTSUBSCRIPT [ italic_i ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_V start_POSTSUBSCRIPT [ italic_i ] end_POSTSUBSCRIPT ⊙ bold_italic_D start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT Current chunk end_POSTSUBSCRIPT + under⏟ start_ARG italic_γ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT bold_italic_S start_POSTSUBSCRIPT [ italic_i - 1 ] end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Past chunk end_POSTSUBSCRIPT , italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_γ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT

where ζ=[γ 1,γ 2,…,γ B]⊤𝜁 superscript superscript 𝛾 1 superscript 𝛾 2…superscript 𝛾 𝐵 top\zeta=[\gamma^{1},\gamma^{2},\ldots,\gamma^{B}]^{\top}italic_ζ = [ italic_γ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_γ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is a column-vector of time-decay scaling factor for inter-chunk attention between the current chunk [i]delimited-[]𝑖[i][ italic_i ] and the previous chunk [i−1]delimited-[]𝑖 1[i-1][ italic_i - 1 ]. Specifically, γ j superscript 𝛾 𝑗\gamma^{j}italic_γ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is the scaling factor for the j t⁢h superscript 𝑗 𝑡 ℎ j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row of chunk [i]delimited-[]𝑖[i][ italic_i ] from the last row of chunk [i−1]delimited-[]𝑖 1[i-1][ italic_i - 1 ] such that the bigger the j 𝑗 j italic_j index the smaller the γ j superscript 𝛾 𝑗\gamma^{j}italic_γ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT value. Therefore, Retention recursively aggregates information from the i 𝑖 i italic_i-th chunk (i.e., intra-chunk embedding) and the previous chunk (i.e., inter-chunk embedding).

For the per-chunk state variable 𝑺[i]subscript 𝑺 delimited-[]𝑖{\bm{S}}{[i]}bold_italic_S start_POSTSUBSCRIPT [ italic_i ] end_POSTSUBSCRIPT, it computes current-chunk information as well as past-chunk information. The current-chunk information 𝑲[i]⊤⁢𝑽[i]superscript subscript 𝑲 delimited-[]𝑖 top subscript 𝑽 delimited-[]𝑖{\bm{K}}{[i]}^{\top}{\bm{V}}{[i]}bold_italic_K start_POSTSUBSCRIPT [ italic_i ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_V start_POSTSUBSCRIPT [ italic_i ] end_POSTSUBSCRIPT decays by 𝑫 B subscript 𝑫 𝐵{\bm{D}}{B}bold_italic_D start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, which is the last row of decay matrix 𝑫 𝑫{\bm{D}}bold_italic_D. The past chunk information 𝑺[i−1]subscript 𝑺 delimited-[]𝑖 1{\bm{S}}{[i-1]}bold_italic_S start_POSTSUBSCRIPT [ italic_i - 1 ] end_POSTSUBSCRIPT is decayed with respect to the chunk size B 𝐵 B italic_B. The initial state variable 𝑺[i=0]=0 subscript 𝑺 delimited-[]𝑖 0 0{\bm{S}}{[i=0]}=0 bold_italic_S start_POSTSUBSCRIPT [ italic_i = 0 ] end_POSTSUBSCRIPT = 0 is computed recurrently given the chunk size B 𝐵 B italic_B:

(18)𝑺[i]=𝑲[i]⊤⁢(𝑽[i]⊙𝑫 B)+γ B⁢𝑺[i−1]=∑m=1 B γ B−m⁢𝑲 m⊤⁢𝑽 m+γ B⁢𝑺[i−1]subscript 𝑺 delimited-[]𝑖 superscript subscript 𝑲 delimited-[]𝑖 top direct-product subscript 𝑽 delimited-[]𝑖 subscript 𝑫 𝐵 superscript 𝛾 𝐵 subscript 𝑺 delimited-[]𝑖 1 superscript subscript 𝑚 1 𝐵 superscript 𝛾 𝐵 𝑚 superscript subscript 𝑲 𝑚 top subscript 𝑽 𝑚 superscript 𝛾 𝐵 subscript 𝑺 delimited-[]𝑖 1{\bm{S}}{[i]}={\bm{K}}{[i]}^{\top}({\bm{V}}{[i]}\odot{\bm{D}}{B})+\gamma^{% B}{\bm{S}}{[i-1]}=\sum{m=1}^{B}\gamma^{B-m}{\bm{K}}{m}^{\top}{\bm{V}}{m}+% \gamma^{B}{\bm{S}}_{[i-1]}bold_italic_S start_POSTSUBSCRIPT [ italic_i ] end_POSTSUBSCRIPT = bold_italic_K start_POSTSUBSCRIPT [ italic_i ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_V start_POSTSUBSCRIPT [ italic_i ] end_POSTSUBSCRIPT ⊙ bold_italic_D start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) + italic_γ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT bold_italic_S start_POSTSUBSCRIPT [ italic_i - 1 ] end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_B - italic_m end_POSTSUPERSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_γ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT bold_italic_S start_POSTSUBSCRIPT [ italic_i - 1 ] end_POSTSUBSCRIPT

Moreover, the update of state variable 𝑺[i]subscript 𝑺 delimited-[]𝑖{\bm{S}}{[i]}bold_italic_S start_POSTSUBSCRIPT [ italic_i ] end_POSTSUBSCRIPT can be reformulated in parallel. The first term represents the information of current chunk, and the second term represented the past-chunk information decayed by the chunk size 𝑩 𝑩{\bm{B}}bold_italic_B. Consequently, 𝑺[i−1]subscript 𝑺 delimited-[]𝑖 1{\bm{S}}{[i-1]}bold_italic_S start_POSTSUBSCRIPT [ italic_i - 1 ] end_POSTSUBSCRIPT represents the state information from the beginning to the (i−1)𝑖 1(i-1)( italic_i - 1 )-th chunk, and we represent the inter-chunk information in chunk-wise Retention:

𝑺[i−1]=∑m=1 B∗i γ B∗i−m⁢𝑲 m⊤⁢𝑽 m=𝑲 1:(B∗i)⊤⊙𝑫 1:(B∗i)⁢𝑽 1:(B∗i)subscript 𝑺 delimited-[]𝑖 1 superscript subscript 𝑚 1 𝐵 𝑖 superscript 𝛾 𝐵 𝑖 𝑚 superscript subscript 𝑲 𝑚 top subscript 𝑽 𝑚 direct-product subscript superscript 𝑲 top:1 𝐵 𝑖 subscript 𝑫:1 𝐵 𝑖 subscript 𝑽:1 𝐵 𝑖\displaystyle{\bm{S}}{[i-1]}=\sum{m=1}^{Bi}\gamma^{Bi-m}{\bm{K}}{m}^{\top% }{\bm{V}}{m}={\bm{K}}^{\top}{1:(B*i)}\odot{\bm{D}}{1:(Bi)}{\bm{V}}_{1:(Bi)}bold_italic_S start_POSTSUBSCRIPT [ italic_i - 1 ] end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B ∗ italic_i end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_B ∗ italic_i - italic_m end_POSTSUPERSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : ( italic_B ∗ italic_i ) end_POSTSUBSCRIPT ⊙ bold_italic_D start_POSTSUBSCRIPT 1 : ( italic_B ∗ italic_i ) end_POSTSUBSCRIPT bold_italic_V start_POSTSUBSCRIPT 1 : ( italic_B ∗ italic_i ) end_POSTSUBSCRIPT (𝑸[i]⁢𝑺[i−1])⊙ζ⏟Inter-chunk=(𝑸(B∗i):(B∗(i+1))⁢𝑲 1:(B∗i)⊤⊙𝑫 1:(B∗i)⁢𝑽 1:(B∗i))⊙ζ subscript⏟direct-product subscript 𝑸 delimited-[]𝑖 subscript 𝑺 delimited-[]𝑖 1 𝜁 Inter-chunk direct-product direct-product subscript 𝑸:𝐵 𝑖 𝐵 𝑖 1 subscript superscript 𝑲 top:1 𝐵 𝑖 subscript 𝑫:1 𝐵 𝑖 subscript 𝑽:1 𝐵 𝑖 𝜁\displaystyle\underbrace{({\bm{Q}}{[i]}{\bm{S}}{[i-1]})\odot\zeta}{\text{% Inter-chunk}}=({\bm{Q}}{(Bi):(B(i+1))}{\bm{K}}^{\top}{1:(B*i)}\odot{\bm{D}% }{1:(Bi)}{\bm{V}}_{1:(Bi)})\odot\zeta under⏟ start_ARG ( bold_italic_Q start_POSTSUBSCRIPT [ italic_i ] end_POSTSUBSCRIPT bold_italic_S start_POSTSUBSCRIPT [ italic_i - 1 ] end_POSTSUBSCRIPT ) ⊙ italic_ζ end_ARG start_POSTSUBSCRIPT Inter-chunk end_POSTSUBSCRIPT = ( bold_italic_Q start_POSTSUBSCRIPT ( italic_B ∗ italic_i ) : ( italic_B ∗ ( italic_i + 1 ) ) end_POSTSUBSCRIPT bold_italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : ( italic_B ∗ italic_i ) end_POSTSUBSCRIPT ⊙ bold_italic_D start_POSTSUBSCRIPT 1 : ( italic_B ∗ italic_i ) end_POSTSUBSCRIPT bold_italic_V start_POSTSUBSCRIPT 1 : ( italic_B ∗ italic_i ) end_POSTSUBSCRIPT ) ⊙ italic_ζ (19)=𝑸(B∗i):(B∗(i+1))⁢𝑲 1:B∗i⊤⊙𝑫(B∗i):(B∗(i+1))⁢𝑽 1:(B∗i)absent direct-product subscript 𝑸:𝐵 𝑖 𝐵 𝑖 1 subscript superscript 𝑲 top:1 𝐵 𝑖 subscript 𝑫:𝐵 𝑖 𝐵 𝑖 1 subscript 𝑽:1 𝐵 𝑖\displaystyle={\bm{Q}}{(Bi):(B(i+1))}{\bm{K}}^{\top}{1:Bi}\odot{\bm{D}}_{% (Bi):(B*(i+1))}{\bm{V}}_{1:(B*i)}= bold_italic_Q start_POSTSUBSCRIPT ( italic_B ∗ italic_i ) : ( italic_B ∗ ( italic_i + 1 ) ) end_POSTSUBSCRIPT bold_italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_B ∗ italic_i end_POSTSUBSCRIPT ⊙ bold_italic_D start_POSTSUBSCRIPT ( italic_B ∗ italic_i ) : ( italic_B ∗ ( italic_i + 1 ) ) end_POSTSUBSCRIPT bold_italic_V start_POSTSUBSCRIPT 1 : ( italic_B ∗ italic_i ) end_POSTSUBSCRIPT

where the intra-chunk computation updates each row of the lower triangular matrix (highlighted as green in Fig. 1.c). Together, the recurrent intra-chunk computation with the parallel intra-chunk computation (highlighted as purple Fig. 1c) completes the chunk-wise forward-pass of Retention.

Image 9: Refer to caption

Figure S2. Schematic of the TimelyGPT Pre-Training Process

B.3. TimelyGPT pre-training overflow

For the TimelyGPT pre-training, we illustrate the full processes of input processing, model training, and next-token prediction in Fig. S2. For a time-series input with T 𝑇 T italic_T timesteps and V 𝑉 V italic_V variates, it is tokenized via a convolution-subsampling module. This tokenizer, typically comprising two 1-D convolution layers with a kernel size of 3 and stride of 2. It produces a sequence of tokens of the shape N×V 𝑁 𝑉 N\times V italic_N × italic_V, effectively reducing the sequence length to 1/4, i.e., N=1/4⁢T 𝑁 1 4 𝑇 N=1/4T italic_N = 1 / 4 italic_T. The sequence of tokens is projected into an input embedding of the shape L×d 𝐿 𝑑 L\times d italic_L × italic_d with an linear projection layer. As a result, the input embedding is passed through L 𝐿 L italic_L generative decoder layers, where the Retention mechanism takes segmented mulitple-chunk input embedding. Finally, the output embedding of the shape N×d 𝑁 𝑑 N\times d italic_N × italic_d is passed through an output projection layer, which generate a sequence of tokens with the shape of L×V 𝐿 𝑉 L\times V italic_L × italic_V for next-token prediction.

Appendix C Experiment Summary

Table S2. Configurations of TimelyGPT, transformer baselines, and recurrent models across different datasets

We summarize the setup of model architecture for TimelyGPT and other baselines for the experiments in TableS2. Additionally, we also provide the visualization of forecasting experiment on the period signal (EEG Pz-Oz) in Fig. S3.

Image 10: Refer to caption

Figure S3. Example of forecasting experiments on the period signal (EEG Pz-Oz). a. the groundtruth of EEG Pz-Oz singal. Forecasting results are shown between 520 and 720 timesteps (b), 1800 and 2000 timesteps (c), and 5800 and 6000 timesteps (d). TimelyGPT is able to forecast the periodic signals up to 6000 timesteps owing to the extrapolation capabilities.

Xet Storage Details

Size:
151 kB
·
Xet hash:
8fb8910ad039d475222e8a084b0dc9ea12e3d574373eb60f16132b4a254c209f

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.