Buckets:
Title: From Attention to Atoms: Spectral Dictionary Learning for Fast, Interpretable Language Models
URL Source: https://arxiv.org/html/2505.00033
Markdown Content: (April 29, 2025)
Abstract
We propose a novel spectral generative modeling framework for natural language processing that jointly learns a global time-varying Fourier dictionary and per-token mixing coefficients, replacing the ubiquitous self-attention mechanism in transformer architectures. By enforcing reconstruction losses in both the time domain (embedding reconstruction) and the frequency domain (via Short-Time Fourier Transform magnitude matching) alongside a standard language modeling objective, and fitting a Gaussian Mixture Model (GMM) prior over the learned mixing vectors, our approach achieves competitive perplexity and generation quality on standard benchmarks such as WikiText-2 and Penn Treebank. In contrast to πͺβ’(L 2)πͺ superscript πΏ 2\mathcal{O}(L^{2})caligraphic_O ( italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) self-attention, our method operates with πͺβ’(Kβ’L)πͺ πΎ πΏ\mathcal{O}(KL)caligraphic_O ( italic_K italic_L ) complexity, where KβͺL much-less-than πΎ πΏ K\ll L italic_K βͺ italic_L is the dictionary size, delivering substantial efficiency gains. We demonstrate that spectral dictionary models can achieve competitive performance compared to transformer baselines while significantly reducing inference latency and memory footprint, offering a compelling alternative for scalable language modeling.
1 Introduction
The advent of the Transformer architecture[18] revolutionized sequence modeling by replacing recurrent and convolutional operations with self-attention mechanisms that directly capture dependencies across arbitrary token distances. Building on this foundation, bi-directional encoders like BERT[6] and autoregressive language models such as the GPT series[16] have achieved state-of-the-art results on a wide range of natural language processing tasks. These models rely on the full LΓL πΏ πΏ L\times L italic_L Γ italic_L attention matrix, where L πΏ L italic_L is the input sequence length, to compute pairwise interactions between tokens. Although highly expressive, this quadratic complexity in both computation and memory becomes prohibitive when scaling to very long contexts, such as entire documents or long code sequences[3, 20].
To mitigate the cost of full self-attention, a variety of approximations have been proposed. Sparse attention patterns exploit locality or fixed windowing, as in Longformer[2] and the block-sparse model of Child et al.[3]; kernel-based methods like Performer[4] use randomized feature maps to approximate softmax attention in linear time; low-rank factorization approaches such as Linformer[19] and linearized attention via kernel methods[9] project keys and queries into subspaces of dimension KβͺL much-less-than πΎ πΏ K\ll L italic_K βͺ italic_L. Other innovations include locality-sensitive hashing in Reformer[13] and learned mixture-of-experts routing to sparsify computation across heads.
Parallel to these, spectral mixing approaches replace learned attention maps with fixed or learned transforms in the Fourier domain. FNet[14] demonstrated that a single global Fourier transform can approximate the mixing power of self-attention, yielding O(Lβ’logβ‘L)πΏ πΏ(L\log L)( italic_L roman_log italic_L ) complexity but with limited adaptability to specific token interactions. Motivated by the efficiency of spectral methods and the expressivity of learned transforms, we propose a fully spectral generative model that learns a global dictionary of K πΎ K italic_K complex-valued Fourier atoms whose amplitude, frequency, and phase parameters adapt dynamically across sequence positions.
In our Spectral Dictionary Generative Model (SDGM), a lightweight convolutional encoder computes per-token mixing coefficients that weight the contribution of each atom to the embedding reconstruction. We train the model end-to-end to reconstruct original token embeddings via a combined loss: mean-squared error (MSE) in the time (embedding) domain, Short-Time Fourier Transform (STFT) magnitude loss to preserve local frequency structure, and a standard language modeling loss. After training, we flatten the learned mixing coefficient vectors across tokens and fit a Gaussian Mixture Model (GMM), enabling rich, multimodal sampling for text generation. By choosing KβͺL much-less-than πΎ πΏ K\ll L italic_K βͺ italic_L, SDGM achieves πͺβ’(Kβ’L)πͺ πΎ πΏ\mathcal{O}(KL)caligraphic_O ( italic_K italic_L ) time and memory complexity per sequence, dramatically reducing resource requirements compared to full attention, while achieving competitive perplexities on standard language modeling benchmarks.
Our contributions are as follows:
- We introduce a novel spectral dictionary architecture that learns interpretable Fourier atoms parameterized by amplitude, frequency, and phase, enabling efficient global mixing with linear complexity (πͺβ’(Kβ’L)πͺ πΎ πΏ\mathcal{O}(KL)caligraphic_O ( italic_K italic_L )).
- We propose a dual-domain reconstruction objective, combining time-domain MSE with frequency-domain STFT magnitude loss, alongside a standard language modeling loss, to ensure both embedding fidelity and predictive performance.
- We demonstrate that fitting a GMM to the learned mixing vectors yields a latent distribution suitable for text generation, complementing the autoregressive nature of the model.
- We validate that SDGM achieves competitive perplexity compared to Transformer baselines on PTB and WikiText-2 while offering significant reductions in memory footprint and inference latency.
2 Related Work
Fourier and Spectral Methods
Fourier transforms have been employed in efficient sequence modeling[14, 8], often as submodules within attention blocks or as fixed transformations replacing the attention mechanism entirely. FNet[14] used unparameterized 2D Fourier transforms. Our work extends spectral approaches by learning an explicit, parameterized Fourier dictionary optimized end-to-end specifically for language modeling reconstruction and generation. While spectral methods have found applications in image generation[11] and wavelet transforms have been explored as attention alternatives[12], our approach uniquely adapts spectral dictionary learning, with learnable sinusoidal parameters and per-token coefficients, to the sequential nature of language.
Attention Alternatives
Numerous techniques aim to reduce attentionβs πͺβ’(L 2)πͺ superscript πΏ 2\mathcal{O}(L^{2})caligraphic_O ( italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) computational cost, including sparse attention[2, 3], kernel methods like Performer[4], low-rank projections like Linformer[19], and hashing methods like Reformer[13]. Unlike these approaches that primarily approximate the standard attention mechanism, SDGM replaces attention entirely with a learned spectral mixing paradigm, offering a fundamentally different approach to sequence interaction modeling.
Dictionary Learning
Classical dictionary learning, prominent in vision and audio processing[1], often involves learning an overcomplete basis (dictionary) and finding sparse representations (codes) for signals, typically optimized via alternating minimization or algorithms like K-SVD. We adapt dictionary learning concepts to NLP by learning continuous, parameterized Fourier atoms and soft mixing coefficients within an autoencoder-like framework trained with gradient descent. Unlike traditional sparse coding that often enforces L 1 subscript πΏ 1 L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularization on codes, our approach models the distribution of the learned mixing coefficients using a GMM, facilitating generative sampling.
3 Mathematical Formulation
In this section, we provide a comprehensive derivation of our Spectral Dictionary Generative Model (SDGM). We begin by defining the embedding sequence and progressing through dictionary parameterization, coefficient encoding, reconstruction decoding, loss formulation, and latent prior modeling. Figure1 illustrates the end-to-end flow of SDGM architecture, showing how raw tokens are progressively transformed into an output distribution.
Figure 1: Architecture of the Spectral Dictionary Generative Model. First, the embedding layer maps each input token w b,t subscript π€ π π‘ w_{b,t}italic_w start_POSTSUBSCRIPT italic_b , italic_t end_POSTSUBSCRIPT to a continuous vector π± b,t=Eβ’(w b,t)subscript π± π π‘ πΈ subscript π€ π π‘\mathbf{x}{b,t}=E(w{b,t})bold_x start_POSTSUBSCRIPT italic_b , italic_t end_POSTSUBSCRIPT = italic_E ( italic_w start_POSTSUBSCRIPT italic_b , italic_t end_POSTSUBSCRIPT ). Next, the mixing encoder applies a one-dimensional convolution to produce soft coefficients C b,t,k subscript πΆ π π‘ π C_{b,t,k}italic_C start_POSTSUBSCRIPT italic_b , italic_t , italic_k end_POSTSUBSCRIPT. The spectral dictionary holds K πΎ K italic_K learnable Fourier atoms parameterized by amplitude a k,d subscript π π π a_{k,d}italic_a start_POSTSUBSCRIPT italic_k , italic_d end_POSTSUBSCRIPT, frequency f k,d subscript π π π f_{k,d}italic_f start_POSTSUBSCRIPT italic_k , italic_d end_POSTSUBSCRIPT, and phase Ο k,d subscript italic-Ο π π\phi_{k,d}italic_Ο start_POSTSUBSCRIPT italic_k , italic_d end_POSTSUBSCRIPT, which generate basis vectors S k,t,d subscript π π π‘ π S_{k,t,d}italic_S start_POSTSUBSCRIPT italic_k , italic_t , italic_d end_POSTSUBSCRIPT. The spectral decoder then reconstructs embeddings via X^b,t,d=βk=1 K C b,t,kβ’S k,t,d subscript^π π π‘ π superscript subscript π 1 πΎ subscript πΆ π π‘ π subscript π π π‘ π\hat{X}{b,t,d}=\sum{k=1}^{K}C_{b,t,k},S_{k,t,d}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_b , italic_t , italic_d end_POSTSUBSCRIPT = β start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_b , italic_t , italic_k end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_k , italic_t , italic_d end_POSTSUBSCRIPT. Finally, the pointer-generator head combines each reconstructed vector π±^b,t subscript^π± π π‘\hat{\mathbf{x}}{b,t}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_b , italic_t end_POSTSUBSCRIPT with a context vector π b,t subscript π π π‘\mathbf{c}{b,t}bold_c start_POSTSUBSCRIPT italic_b , italic_t end_POSTSUBSCRIPT to compute a mixture of vocabulary and copy distributions for token prediction.
3.1 Token Embeddings and Notation
Let π=[w 1,w 2,β¦,w L]π subscript π€ 1 subscript π€ 2β¦subscript π€ πΏ\mathbf{W}=[w_{1},w_{2},\dots,w_{L}]bold_W = [ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , β¦ , italic_w start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ] be an input sequence of L πΏ L italic_L tokens. We first map these tokens to continuous vector representations using an embedding lookup table E πΈ E italic_E. For a mini-batch of B π΅ B italic_B sequences, let π(b)=[π± b,1,π± b,2,β¦,π± b,L]ββ DΓL superscript π π subscript π± π 1 subscript π± π 2β¦subscript π± π πΏ superscript β π· πΏ\mathbf{X}^{(b)}=[\mathbf{x}{b,1},\mathbf{x}{b,2},\dots,\mathbf{x}{b,L}]\in% \mathbb{R}^{D\times L}bold_X start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT = [ bold_x start_POSTSUBSCRIPT italic_b , 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_b , 2 end_POSTSUBSCRIPT , β¦ , bold_x start_POSTSUBSCRIPT italic_b , italic_L end_POSTSUBSCRIPT ] β blackboard_R start_POSTSUPERSCRIPT italic_D Γ italic_L end_POSTSUPERSCRIPT denote the sequence of L πΏ L italic_L token embeddings for batch item b π b italic_b, where each embedding π± b,tββ D subscript π± π π‘ superscript β π·\mathbf{x}{b,t}\in\mathbb{R}^{D}bold_x start_POSTSUBSCRIPT italic_b , italic_t end_POSTSUBSCRIPT β blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT has dimension D π· D italic_D.
π± b,t=Eβ’(w b,t).subscript π± π π‘ πΈ subscript π€ π π‘\mathbf{x}{b,t}=E(w{b,t}).bold_x start_POSTSUBSCRIPT italic_b , italic_t end_POSTSUBSCRIPT = italic_E ( italic_w start_POSTSUBSCRIPT italic_b , italic_t end_POSTSUBSCRIPT ) .(1)
For clarity, we omit the batch index b π b italic_b in subsequent notation unless explicitly needed.
3.2 Global Spectral Dictionary Parameterization
We learn a set of K πΎ K italic_K spectral atoms, each atom kβ{1,β¦,K}π 1β¦πΎ k\in{1,\ldots,K}italic_k β { 1 , β¦ , italic_K } parameterized by three matrices:
β’ Amplitude: πββ KΓD π superscript β πΎ π·\mathbf{A}\in\mathbb{R}^{K\times D}bold_A β blackboard_R start_POSTSUPERSCRIPT italic_K Γ italic_D end_POSTSUPERSCRIPT with entries a k,d subscript π π π a_{k,d}italic_a start_POSTSUBSCRIPT italic_k , italic_d end_POSTSUBSCRIPT,
β’ Frequency: π ββ KΓD π superscript β πΎ π·\mathbf{F}\in\mathbb{R}^{K\times D}bold_F β blackboard_R start_POSTSUPERSCRIPT italic_K Γ italic_D end_POSTSUPERSCRIPT with entries f k,d subscript π π π f_{k,d}italic_f start_POSTSUBSCRIPT italic_k , italic_d end_POSTSUBSCRIPT,
β’ Phase: π½ββ KΓD π½ superscript β πΎ π·\boldsymbol{\Phi}\in\mathbb{R}^{K\times D}bold_Ξ¦ β blackboard_R start_POSTSUPERSCRIPT italic_K Γ italic_D end_POSTSUPERSCRIPT with entries Ο k,d subscript italic-Ο π π\phi_{k,d}italic_Ο start_POSTSUBSCRIPT italic_k , italic_d end_POSTSUBSCRIPT.
These parameters define a time-varying sinusoidal basis. For a continuous time index tβ{1,2,β¦,L}π‘ 1 2β¦πΏ t\in{1,2,\dots,L}italic_t β { 1 , 2 , β¦ , italic_L }, the d π d italic_d-th feature of atom k π k italic_k is given by:
S kβ’(t)d=a k,dβ’sinβ‘(2β’Οβ’f k,dβ’t L+Ο k,d).subscript π π subscript π‘ π subscript π π π 2 π subscript π π π π‘ πΏ subscript italic-Ο π π S_{k}(t){d},=,a{k,d},\sin\Bigl{(}2\pi,f_{k,d},\frac{t}{L}+\phi_{k,d}% \Bigr{)}.italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_k , italic_d end_POSTSUBSCRIPT roman_sin ( 2 italic_Ο italic_f start_POSTSUBSCRIPT italic_k , italic_d end_POSTSUBSCRIPT divide start_ARG italic_t end_ARG start_ARG italic_L end_ARG + italic_Ο start_POSTSUBSCRIPT italic_k , italic_d end_POSTSUBSCRIPT ) .(2)
We collect all atoms into a tensor Sββ KΓLΓD π superscript β πΎ πΏ π· S\in\mathbb{R}^{K\times L\times D}italic_S β blackboard_R start_POSTSUPERSCRIPT italic_K Γ italic_L Γ italic_D end_POSTSUPERSCRIPT where S k,t,d=S kβ’(t)d subscript π π π‘ π subscript π π subscript π‘ π S_{k,t,d}=S_{k}(t)_{d}italic_S start_POSTSUBSCRIPT italic_k , italic_t , italic_d end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. Equivalently, by defining the normalized time vector π=[1/L,2/L,β¦,1]ββ L π 1 πΏ 2 πΏβ¦1 superscript β πΏ\mathbf{t}=[1/L,2/L,\dots,1]\in\mathbb{R}^{L}bold_t = [ 1 / italic_L , 2 / italic_L , β¦ , 1 ] β blackboard_R start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, we can write in vectorized form for fixed atom k π k italic_k:
S k=π kβsinβ‘(2β’Οβ’(π kβπ)+Ο kβπ L)subscript π π direct-product subscript π π 2 π tensor-product subscript π π π tensor-product subscript bold-italic-Ο π subscript 1 πΏ S_{k}=\mathbf{a}{k}\odot\sin\bigl{(}2\pi,(\mathbf{f}{k}\otimes\mathbf{t})+% \boldsymbol{\phi}{k}\otimes\mathbf{1}{L}\bigr{)}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT β roman_sin ( 2 italic_Ο ( bold_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT β bold_t ) + bold_italic_Ο start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT β bold_1 start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT )(3)
where βdirect-product\odotβ denotes element-wise multiplication and βtensor-product\otimesβ the outer product. This dictionary is shared across all sequences in a batch and represents a global basis learned from the data.
3.3 Mixing Coefficient Encoding
To capture how each atomβs contribution varies dynamically based on the input sequence context, we employ a lightweight convolutional encoder. This encoder takes the sequence of input embeddings πββ BΓLΓD π superscript β π΅ πΏ π·\mathbf{X}\in\mathbb{R}^{B\times L\times D}bold_X β blackboard_R start_POSTSUPERSCRIPT italic_B Γ italic_L Γ italic_D end_POSTSUPERSCRIPT (transposed for Conv1D compatibility if needed) and produces per-token mixing coefficients.
C=Ο actβ’(Conv1Dβ’(π))ββ BΓLΓK.πΆ subscript π act Conv1D π superscript β π΅ πΏ πΎ C=\sigma_{\text{act}}(\mathrm{Conv1D}(\mathbf{X}))\in\mathbb{R}^{B\times L% \times K}.italic_C = italic_Ο start_POSTSUBSCRIPT act end_POSTSUBSCRIPT ( Conv1D ( bold_X ) ) β blackboard_R start_POSTSUPERSCRIPT italic_B Γ italic_L Γ italic_K end_POSTSUPERSCRIPT .(4)
Here, Conv1D Conv1D\mathrm{Conv1D}Conv1D is a 1D convolution (typically causal for autoregressive tasks) with appropriate kernel size w π€ w italic_w and padding, mapping the D π· D italic_D-dimensional embeddings to K πΎ K italic_K coefficients per time step t π‘ t italic_t. Ο act subscript π act\sigma_{\text{act}}italic_Ο start_POSTSUBSCRIPT act end_POSTSUBSCRIPT is a suitable activation function (e.g., ReLU or identity, depending on whether non-negativity is desired). Optionally, coefficients C b,t,:subscript πΆ π π‘:C_{b,t,:}italic_C start_POSTSUBSCRIPT italic_b , italic_t , : end_POSTSUBSCRIPT can be normalized (e.g., via Softmax) across the dictionary dimension k π k italic_k. These coefficients C b,t,k subscript πΆ π π‘ π C_{b,t,k}italic_C start_POSTSUBSCRIPT italic_b , italic_t , italic_k end_POSTSUBSCRIPT represent the learned "importance" or "weight" of atom k π k italic_k at position t π‘ t italic_t for sequence b π b italic_b.
3.4 Spectral Reconstruction Decoder
Given the mixing coefficients C πΆ C italic_C and the global spectral dictionary S π S italic_S, we reconstruct the embedding sequence π^^π\hat{\mathbf{X}}over^ start_ARG bold_X end_ARG via a weighted sum of the spectral atoms at each time step t π‘ t italic_t and for each feature dimension d π d italic_d:
π^b,t,dββk=1 K C b,t,kβ’S k,t,d.βsubscript^π π π‘ π superscript subscript π 1 πΎ subscript πΆ π π‘ π subscript π π π‘ π\hat{\mathbf{X}}{b,t,d}\coloneqq\sum{k=1}^{K}C_{b,t,k},S_{k,t,d}.over^ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_b , italic_t , italic_d end_POSTSUBSCRIPT β β start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_b , italic_t , italic_k end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_k , italic_t , italic_d end_POSTSUBSCRIPT .(5)
This operation performs the dynamic mixing of the global atoms based on the sequence-specific coefficients. In tensor notation, assuming S π S italic_S is appropriately shaped (e.g., KΓLΓD πΎ πΏ π· K\times L\times D italic_K Γ italic_L Γ italic_D), this corresponds to a bilinear mapping, efficiently computed using Einstein summation convention:
π^=einsum(β²b l k,k l dβ>b l dβ²,C,S),\hat{\mathbf{X}}=\mathrm{einsum}(^{\prime}blk,kld->bld^{\prime},C,S),over^ start_ARG bold_X end_ARG = roman_einsum ( start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT italic_b italic_l italic_k , italic_k italic_l italic_d - > italic_b italic_l italic_d start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT , italic_C , italic_S ) ,(6)
where dimensions correspond to (Batch, Length, Dictionary) for C πΆ C italic_C and (Dictionary, Length, Dimension) for S π S italic_S, resulting in π^^π\hat{\mathbf{X}}over^ start_ARG bold_X end_ARG with shape (Batch, Length, Dimension). This decoding step has a computational complexity of πͺβ’(Bβ Kβ Lβ D)πͺβ π΅ πΎ πΏ π·\mathcal{O}(B\cdot K\cdot L\cdot D)caligraphic_O ( italic_B β italic_K β italic_L β italic_D ). For fixed K πΎ K italic_K and D π· D italic_D, this is πͺβ’(Bβ’L)πͺ π΅ πΏ\mathcal{O}(BL)caligraphic_O ( italic_B italic_L ), or πͺβ’(Kβ’L)πͺ πΎ πΏ\mathcal{O}(KL)caligraphic_O ( italic_K italic_L ) per sequence when accounting for K πΎ K italic_K, linear in the sequence length L πΏ L italic_L.
3.5 Training Objective
We train the SDGM end-to-end by minimizing a composite loss function β β\mathcal{L}caligraphic_L that combines reconstruction fidelity in both the time and frequency domains with a standard language modeling objective:
β=Ξ±β’β time+Ξ²β’β freq+Ξ³β’β NLL+Ξ΄β’β prior.β πΌ subscript β time π½ subscript β freq πΎ subscript β NLL πΏ subscript β prior\mathcal{L}=\alpha,\mathcal{L}{\text{time}}+\beta,\mathcal{L}{\text{freq}}% +\gamma,\mathcal{L}{\text{NLL}}+\delta,\mathcal{L}{\mathrm{prior}}.caligraphic_L = italic_Ξ± caligraphic_L start_POSTSUBSCRIPT time end_POSTSUBSCRIPT + italic_Ξ² caligraphic_L start_POSTSUBSCRIPT freq end_POSTSUBSCRIPT + italic_Ξ³ caligraphic_L start_POSTSUBSCRIPT NLL end_POSTSUBSCRIPT + italic_Ξ΄ caligraphic_L start_POSTSUBSCRIPT roman_prior end_POSTSUBSCRIPT .(7)
The components are:
- β’Time-domain MSE Loss (β time subscript β time\mathcal{L}_{\text{time}}caligraphic_L start_POSTSUBSCRIPT time end_POSTSUBSCRIPT): Penalizes the difference between the reconstructed embeddings π^^π\hat{\mathbf{X}}over^ start_ARG bold_X end_ARG and the original embeddings π π\mathbf{X}bold_X.
β timeβ1 Bβ Lβ Dβ’βπ^βπβF 2,βsubscript β time 1β π΅ πΏ π· superscript subscript norm^π π πΉ 2\mathcal{L}{\text{time}}\coloneqq\frac{1}{B\cdot L\cdot D}|\hat{\mathbf{X}}-% \mathbf{X}|{F}^{2},caligraphic_L start_POSTSUBSCRIPT time end_POSTSUBSCRIPT β divide start_ARG 1 end_ARG start_ARG italic_B β italic_L β italic_D end_ARG β₯ over^ start_ARG bold_X end_ARG - bold_X β₯ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(8)
where β₯β β₯F|\cdot|_{F}β₯ β β₯ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT denotes the Frobenius norm.
- β’Frequency-domain STFT Loss (β freq subscript β freq\mathcal{L}_{\text{freq}}caligraphic_L start_POSTSUBSCRIPT freq end_POSTSUBSCRIPT): Encourages the reconstructed sequence to match the original sequence in terms of local frequency content by minimizing the difference between their STFT magnitudes.
β freqβ1 Bβ Fβ Tβ Dβ’β|STFTβ’(π^)|β|STFTβ’(π)|βF 2,βsubscript β freq 1β π΅ πΉ π π· superscript subscript norm STFT^π STFT π πΉ 2\mathcal{L}{\text{freq}}\coloneqq\frac{1}{B\cdot F\cdot T\cdot D}\left||% \mathrm{STFT}(\hat{\mathbf{X}})|-|\mathrm{STFT}(\mathbf{X})|\right|{F}^{2},caligraphic_L start_POSTSUBSCRIPT freq end_POSTSUBSCRIPT β divide start_ARG 1 end_ARG start_ARG italic_B β italic_F β italic_T β italic_D end_ARG β₯ | roman_STFT ( over^ start_ARG bold_X end_ARG ) | - | roman_STFT ( bold_X ) | β₯ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(9)
where STFTβ’(β )STFTβ \mathrm{STFT}(\cdot)roman_STFT ( β ) computes the Short-Time Fourier Transform independently for each feature channel d π d italic_d, resulting in a complex spectrogram, |β ||\cdot|| β | takes the magnitude, and F,T πΉ π F,T italic_F , italic_T are the frequency and time dimensions of the STFT output.
- β’Language Modeling Loss (β NLL subscript β NLL\mathcal{L}{\text{NLL}}caligraphic_L start_POSTSUBSCRIPT NLL end_POSTSUBSCRIPT): Standard Negative Log-Likelihood (NLL) loss for autoregressive prediction. Assuming the model predicts the next token w b,t+1 subscript π€ π π‘ 1 w{b,t+1}italic_w start_POSTSUBSCRIPT italic_b , italic_t + 1 end_POSTSUBSCRIPT based on the reconstructed representation π±^b,t subscript^π± π π‘\hat{\mathbf{x}}_{b,t}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_b , italic_t end_POSTSUBSCRIPT (or some derived hidden state), using a final prediction head (e.g., linear layer + Softmax or the Pointer-Generator described later).
β NLLββ1 Bβ Lβ’βb=1 Bβt=1 L logβ‘Pβ’(w b,tβ£π±^b,<t,π° b,<t),βsubscript β NLL 1β π΅ πΏ superscript subscript π 1 π΅ superscript subscript π‘ 1 πΏ π conditional subscript π€ π π‘ subscript^π± π absent π‘ subscript π° π absent π‘\mathcal{L}{\text{NLL}}\coloneqq-\frac{1}{B\cdot L}\sum{b=1}^{B}\sum_{t=1}^{% L}\log P(w_{b,t}\mid\hat{\mathbf{x}}{b,<t},\mathbf{w}{b,<t}),caligraphic_L start_POSTSUBSCRIPT NLL end_POSTSUBSCRIPT β - divide start_ARG 1 end_ARG start_ARG italic_B β italic_L end_ARG β start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT β start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT roman_log italic_P ( italic_w start_POSTSUBSCRIPT italic_b , italic_t end_POSTSUBSCRIPT β£ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_b , < italic_t end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT italic_b , < italic_t end_POSTSUBSCRIPT ) ,(10)
where the exact formulation depends on the specific autoregressive setup and prediction head used. Thus, the composite loss capturing fidelity in both time and frequency domains, a language modeling objective plus a GMM prior loss is given by:
β=Ξ±β’β₯π^βπβ₯F 2βTime-domain MSE+Ξ²β’β₯|STFT(π^)|β|STFT(π)|β₯F 2βFrequency-domain MSE+Ξ³β’β MLMβ’(π^,π)βMasked LM Loss.\mathcal{L}=\alpha,\underbrace{\Bigl{|}\hat{\mathbf{X}}-\mathbf{X}\Bigr{|}% {F}^{2}}{\text{Time-domain MSE}}+\beta,\underbrace{\Bigl{|}\bigl{|}\mathrm{% STFT}(\hat{\mathbf{X}})\bigr{|}-\bigl{|}\mathrm{STFT}(\mathbf{X})\bigr{|}\Bigr% {|}{F}^{2}}{\text{Frequency-domain MSE}}+\gamma,\underbrace{\mathcal{L}{% \mathrm{MLM}}\bigl{(}\hat{\mathbf{X}},\mathbf{X}\bigr{)}}{\text{Masked LM % Loss}}.caligraphic_L = italic_Ξ± underβ start_ARG β₯ over^ start_ARG bold_X end_ARG - bold_X β₯ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT Time-domain MSE end_POSTSUBSCRIPT + italic_Ξ² underβ start_ARG β₯ | roman_STFT ( over^ start_ARG bold_X end_ARG ) | - | roman_STFT ( bold_X ) | β₯ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT Frequency-domain MSE end_POSTSUBSCRIPT + italic_Ξ³ underβ start_ARG caligraphic_L start_POSTSUBSCRIPT roman_MLM end_POSTSUBSCRIPT ( over^ start_ARG bold_X end_ARG , bold_X ) end_ARG start_POSTSUBSCRIPT Masked LM Loss end_POSTSUBSCRIPT .(11)
- β’GMM Prior Loss After flattening Cββ BΓLΓK πΆ superscript β π΅ πΏ πΎ C\in\mathbb{R}^{B\times L\times K}italic_C β blackboard_R start_POSTSUPERSCRIPT italic_B Γ italic_L Γ italic_K end_POSTSUPERSCRIPT into {π³ n}n=1 N superscript subscript subscript π³ π π 1 π{\mathbf{z}{n}}{n=1}^{N}{ bold_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT with N=Bβ L πβ π΅ πΏ N=B\cdot L italic_N = italic_B β italic_L, we compute
β prior=β1 Nβ’βn=1 N logβ‘p GMMβ’(π³ n),subscript β prior 1 π superscript subscript π 1 π subscript π GMM subscript π³ π\mathcal{L}{\mathrm{prior}}=-\frac{1}{N}\sum{n=1}^{N}\log p_{\mathrm{GMM}}(% \mathbf{z}_{n}),caligraphic_L start_POSTSUBSCRIPT roman_prior end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG β start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT roman_GMM end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ,(12)
where p GMMβ’(π³)=βm=1 M Ο mβ’π©β’(π³;ΞΌ m,Ξ£ m)subscript π GMM π³ superscript subscript π 1 π subscript π π π© π³ subscript π π subscript Ξ£ π\displaystyle p_{\mathrm{GMM}}(\mathbf{z})=\sum_{m=1}^{M}\pi_{m},\mathcal{N}% \bigl{(}\mathbf{z};\mu_{m},\Sigma_{m}\bigr{)}italic_p start_POSTSUBSCRIPT roman_GMM end_POSTSUBSCRIPT ( bold_z ) = β start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Ο start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT caligraphic_N ( bold_z ; italic_ΞΌ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , roman_Ξ£ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) is the fitted mixture over mixing-vector space.
Thus, the composite loss capturing fidelity in both time and frequency domains, plus a language modeling objective is given by full training objective:
β=Ξ±β’βπ^βπβF 2ββ time+Ξ²β’β₯|STFTβ’(π^)|β|STFTβ’(π)|β₯F 2ββ freq+Ξ³β’β MLMβ’(π^,π)βMasked LM Loss+Ξ΄β’β prior.β πΌ subscriptβsuperscript subscript norm^π π πΉ 2 subscript β time π½ subscriptβsuperscript subscript delimited-β₯β₯STFT^π STFT π πΉ 2 subscript β freq πΎ subscriptβsubscript β MLM^π π Masked LM Loss πΏ subscript β prior\mathcal{L}=\alpha\underbrace{|\hat{\mathbf{X}}-\mathbf{X}|{F}^{2}}{% \mathcal{L}{\mathrm{time}}}+\beta\underbrace{\bigl{|};|\mathrm{STFT}(\hat{% \mathbf{X}})|-|\mathrm{STFT}(\mathbf{X})|\bigr{|}{F}^{2}}{\mathcal{L}{% \mathrm{freq}}}+\gamma,\underbrace{\mathcal{L}{\mathrm{MLM}}\bigl{(}\hat{% \mathbf{X}},\mathbf{X}\bigr{)}}{\text{Masked LM Loss}}+\delta,\mathcal{L}_{% \mathrm{prior}}.caligraphic_L = italic_Ξ± underβ start_ARG β₯ over^ start_ARG bold_X end_ARG - bold_X β₯ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_time end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_Ξ² underβ start_ARG β₯ | roman_STFT ( over^ start_ARG bold_X end_ARG ) | - | roman_STFT ( bold_X ) | β₯ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_freq end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_Ξ³ underβ start_ARG caligraphic_L start_POSTSUBSCRIPT roman_MLM end_POSTSUBSCRIPT ( over^ start_ARG bold_X end_ARG , bold_X ) end_ARG start_POSTSUBSCRIPT Masked LM Loss end_POSTSUBSCRIPT + italic_Ξ΄ caligraphic_L start_POSTSUBSCRIPT roman_prior end_POSTSUBSCRIPT .(13)
The hyperparameters Ξ±,Ξ²,Ξ³β₯0 πΌ π½ πΎ 0\alpha,\beta,\gamma\geq 0 italic_Ξ± , italic_Ξ² , italic_Ξ³ β₯ 0 balance the contributions of these objectives and Ξ΄ πΏ\delta italic_Ξ΄ controls the strength of the GMM prior regularizer, guiding the learned coefficients toward regions of high prior density, and, β₯β β₯F|\cdot|_{F}β₯ β β₯ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is the Frobenius norm and STFT is applied independently on each feature channel. The hyperparameters (Ξ±,Ξ²,Ξ³)πΌ π½ πΎ(\alpha,\beta,\gamma)( italic_Ξ± , italic_Ξ² , italic_Ξ³ ) balance reconstruction versus predictive performance.
3.6 Latent Prior Fitting
After the model parameters (embedding table E πΈ E italic_E, dictionary parameters π,π ,π½ π π π½\mathbf{A},\mathbf{F},\boldsymbol{\Phi}bold_A , bold_F , bold_Ξ¦, Conv1D weights) have converged, we analyze the distribution of the learned mixing coefficients. We collect all per-position coefficient vectors C b,t,:ββ K subscript πΆ π π‘:superscript β πΎ C_{b,t,:}\in\mathbb{R}^{K}italic_C start_POSTSUBSCRIPT italic_b , italic_t , : end_POSTSUBSCRIPT β blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT from the training (or validation) set, flatten them into a large matrix πββ NΓK π superscript β π πΎ\mathbf{Z}\in\mathbb{R}^{N\times K}bold_Z β blackboard_R start_POSTSUPERSCRIPT italic_N Γ italic_K end_POSTSUPERSCRIPT (where N=BΓLΓ#β’batches π π΅ πΏ#batches N=B\times L\times#\text{batches}italic_N = italic_B Γ italic_L Γ # batches), and fit a Gaussian Mixture Model (GMM) to this data:
pβ’(π³)=βm=1 M Ο mβ’π©β’(π³;π m,πΊ m),π π³ superscript subscript π 1 π subscript π π π© π³ subscript π π subscript πΊ π p(\mathbf{z})=\sum_{m=1}^{M}\pi_{m},\mathcal{N}(\mathbf{z};\boldsymbol{\mu}{% m},\boldsymbol{\Sigma}{m}),italic_p ( bold_z ) = β start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Ο start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT caligraphic_N ( bold_z ; bold_italic_ΞΌ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_Ξ£ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ,(14)
where M π M italic_M is the number of mixture components, Ο m subscript π π\pi_{m}italic_Ο start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are the mixture weights (βΟ m=1 subscript π π 1\sum\pi_{m}=1β italic_Ο start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 1), π mββ K subscript π π superscript β πΎ\boldsymbol{\mu}{m}\in\mathbb{R}^{K}bold_italic_ΞΌ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT β blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT are the component means, and πΊ m subscript πΊ π\boldsymbol{\Sigma}{m}bold_Ξ£ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are the component covariance matrices (often assumed diagonal for simplicity, πΊ m=diagβ’(Ο m,1 2,β¦,Ο m,K 2)subscript πΊ π diag superscript subscript π π 1 2β¦superscript subscript π π πΎ 2\boldsymbol{\Sigma}{m}=\mathrm{diag}(\sigma{m,1}^{2},\dots,\sigma_{m,K}^{2})bold_Ξ£ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = roman_diag ( italic_Ο start_POSTSUBSCRIPT italic_m , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , β¦ , italic_Ο start_POSTSUBSCRIPT italic_m , italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )). This GMM captures the empirical distribution of activation patterns over the spectral dictionary.
3.7 Token Generation
For autoregressive text generation, we generate one token at a time for t=1,2,β¦,Lβ²π‘ 1 2β¦superscript πΏβ²t=1,2,\dots,L^{\prime}italic_t = 1 , 2 , β¦ , italic_L start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT (the desired output length) by sampling from the learned spectral prior and decoding through the dictionary and pointer-generator head:
- 1.Sample Mixing Vector. Draw a coefficient vector π³ tββ K subscript π³ π‘ superscript β πΎ\mathbf{z}_{t}\in\mathbb{R}^{K}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT β blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT from the fitted Gaussian Mixture Model prior:
π³ tβΌpβ’(π³)=βm=1 M Ο mβ’π©β’(π³;π m,πΊ m).similar-to subscript π³ π‘ π π³ superscript subscript π 1 π subscript π π π© π³ subscript π π subscript πΊ π\mathbf{z}{t}\sim p(\mathbf{z});=;\sum{m=1}^{M}\pi_{m},\mathcal{N}(% \mathbf{z};\boldsymbol{\mu}{m},\boldsymbol{\Sigma}{m}).bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT βΌ italic_p ( bold_z ) = β start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Ο start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT caligraphic_N ( bold_z ; bold_italic_ΞΌ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_Ξ£ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) . 2. 2.Decode to Embedding. Reconstruct the D π· D italic_D-dimensional embedding for step t π‘ t italic_t by mixing the K πΎ K italic_K spectral atoms evaluated at time t π‘ t italic_t:
π±^t=βk=1 K z t,kβ’S kβ’(t),S kβ’(t)ββ D.formulae-sequence subscript^π± π‘ superscript subscript π 1 πΎ subscript π§ π‘ π subscript π π π‘ subscript π π π‘ superscript β π·\hat{\mathbf{x}}{t}=\sum{k=1}^{K}z_{t,k};S_{k}(t),\quad S_{k}(t)\in\mathbb{% R}^{D}.over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = β start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) , italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) β blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT . 3. 3.Compute Token Distribution. Use the reconstructed embedding π±^t subscript^π± π‘\hat{\mathbf{x}}{t}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT together with an autoregressive context vector π t subscript π π‘\mathbf{c}{t}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to produce a mixture of vocabulary-generation and copying:
p gen subscript π gen\displaystyle p_{\mathrm{gen}}italic_p start_POSTSUBSCRIPT roman_gen end_POSTSUBSCRIPT=Οβ’(π° genβ€β’[π±^t;π t]),absent π superscript subscript π° gen top subscript^π± π‘ subscript π π‘\displaystyle=\sigma\bigl{(}\mathbf{w}{\mathrm{gen}}^{!\top}[\hat{\mathbf{x}% }{t};,\mathbf{c}{t}]\bigr{)},= italic_Ο ( bold_w start_POSTSUBSCRIPT roman_gen end_POSTSUBSCRIPT start_POSTSUPERSCRIPT β€ end_POSTSUPERSCRIPT [ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ) ,(15) Pβ’(wβ£π±^t,π t)π conditional π€ subscript^π± π‘ subscript π π‘\displaystyle P\bigl{(}w\mid\hat{\mathbf{x}}{t},\mathbf{c}{t}\bigr{)}italic_P ( italic_w β£ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )=p genβ’P vocabβ’(wβ£π±^t,π t)+(1βp gen)β’P copyβ’(wβ£context).absent subscript π gen subscript π vocab conditional π€ subscript^π± π‘ subscript π π‘ 1 subscript π gen subscript π copy conditional π€ context\displaystyle=p{\mathrm{gen}};P_{\mathrm{vocab}}\bigl{(}w\mid\hat{\mathbf{x}% }{t},\mathbf{c}{t}\bigr{)};+;(1-p_{\mathrm{gen}});P_{\mathrm{copy}}\bigl{% (}w\mid\text{context}\bigr{)}.= italic_p start_POSTSUBSCRIPT roman_gen end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT roman_vocab end_POSTSUBSCRIPT ( italic_w β£ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ( 1 - italic_p start_POSTSUBSCRIPT roman_gen end_POSTSUBSCRIPT ) italic_P start_POSTSUBSCRIPT roman_copy end_POSTSUBSCRIPT ( italic_w β£ context ) .(16)
Here, Ο π\sigma italic_Ο is the logistic sigmoid, P vocab subscript π vocab P_{\mathrm{vocab}}italic_P start_POSTSUBSCRIPT roman_vocab end_POSTSUBSCRIPT is the standard softmax over the fixed vocabulary, and P copy subscript π copy P_{\mathrm{copy}}italic_P start_POSTSUBSCRIPT roman_copy end_POSTSUBSCRIPT attends over previously generated or input tokens. 4. 4.Sample or Select Token. Draw the next token w t subscript π€ π‘ w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the resulting distribution
w tβΌPβ’(wβ£π±^t,π t),similar-to subscript π€ π‘ π conditional π€ subscript^π± π‘ subscript π π‘ w_{t}\sim P\bigl{(}w\mid\hat{\mathbf{x}}{t},\mathbf{c}{t}\bigr{)},italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT βΌ italic_P ( italic_w β£ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,
or choose argβ‘max wβ‘Pβ’(wβ£π±^t,π t)subscript π€ π conditional π€ subscript^π± π‘ subscript π π‘\arg\max_{w}P(w\mid\hat{\mathbf{x}}{t},\mathbf{c}{t})roman_arg roman_max start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_P ( italic_w β£ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Append w t subscript π€ π‘ w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the output sequence and update the context π t+1 subscript π π‘ 1\mathbf{c}_{t+1}bold_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT (e.g., via the same Conv1D encoder or an RNN state) for the next time step.
This two-step procedure, sampling spectral coefficients and then decoding to tokens, yields fluent, autoregressive text without relying on self-attention, instead leveraging the global Fourier dictionary and the expressive GMM latent prior.
4 Experimental Evaluation
4.1 Datasets and Baselines
We evaluate SDGM on two standard language modeling benchmarks:
β’ WikiText-2: Contains approximately 2M training tokens, 218k validation tokens, and 246k test tokens, drawn from verified Good and Featured articles on Wikipedia.
β’ Penn Treebank (PTB): Comprises around 1M training tokens, 70k validation tokens, and 80k test tokens from the Wall Street Journal corpus.
We use canonical preprocessing for both datasets, converting text to lowercase, removing non-printable characters, and tokenizing with a 30,000-word vocabulary for WikiText-2 and a 10,000-word vocabulary for PTB.
We compare against three strong baselines:
β’ Transformer-XL[5]: Extends self-attention with segment-level recurrence
β’ GPT-2 Small[17]: An autoregressive decoder-only model
β’ Linformer[19]: Approximates full attention via low-rank projections
All baselines are retrained under identical data splits and tokenization schemes to ensure a fair comparison.
4.2 Implementation Details
Our SDGM implementation uses PyTorch[15] and trains on a single NVIDIA V100 GPU (16GB). We set embedding dimension D=512 π· 512 D=512 italic_D = 512, dictionary size K=256 πΎ 256 K=256 italic_K = 256, and sequence length L=128 πΏ 128 L=128 italic_L = 128.
For the STFT computation, we use FFT size n fft=256 subscript π fft 256 n_{\text{fft}}=256 italic_n start_POSTSUBSCRIPT fft end_POSTSUBSCRIPT = 256, hop length 64, and a Hann window of length 256. Loss hyperparameters are set to (Ξ±,Ξ²,Ξ³)=(1.0,0.5,0.1)πΌ π½ πΎ 1.0 0.5 0.1(\alpha,\beta,\gamma)=(1.0,0.5,0.1)( italic_Ξ± , italic_Ξ² , italic_Ξ³ ) = ( 1.0 , 0.5 , 0.1 ) to balance time-domain MSE, frequency-domain MSE, and masked LM loss.
We optimize with Adam[10] using learning rate 10β3 superscript 10 3 10^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and weight decay 10β5 superscript 10 5 10^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, batch size 32, and gradient clipping at norm 1.0. Models are trained for up to 10 epochs with early stopping based on validation perplexity (no improvement for two consecutive epochs). Random seeds are fixed across PyTorch, NumPy, and Pythonβs RNG to ensure reproducibility.
4.3 Evaluation Metrics
We evaluate model performance using the following metrics:
β’ Perplexity (PPL): Exponentiated average negative log-likelihood per token (expβ‘(β NLL)subscript β NLL\exp(\mathcal{L}_{\mathrm{NLL}})roman_exp ( caligraphic_L start_POSTSUBSCRIPT roman_NLL end_POSTSUBSCRIPT )), computed on validation and test sets. Lower is better.
β’ Inference Speed: Tokens generated per second (tok/s) during autoregressive sampling on the target GPU. Higher is better.
β’ Parameter Count: Total number of trainable parameters (in Millions, M). Lower indicates a more compact model.
β’ Memory Footprint: Peak GPU memory usage (in Gigabytes, GB) during inference. Lower is better.
β’ Embedding Fidelity: Average cosine similarity between reconstructed embeddings π^^π\hat{\mathbf{X}}over^ start_ARG bold_X end_ARG and original embeddings π π\mathbf{X}bold_X on the validation set. Higher indicates better reconstruction quality.
We also perform ablation studies by systematically removing components of our composite loss function (β freq subscript β freq\mathcal{L}{\text{freq}}caligraphic_L start_POSTSUBSCRIPT freq end_POSTSUBSCRIPT, β NLL subscript β NLL\mathcal{L}{\text{NLL}}caligraphic_L start_POSTSUBSCRIPT NLL end_POSTSUBSCRIPT contributions set to zero by setting Ξ²=0 π½ 0\beta=0 italic_Ξ² = 0 or Ξ³=0 πΎ 0\gamma=0 italic_Ξ³ = 0).
4.4 Results
Table1 presents the main results comparing SDGM against baselines on WikiText-2 and PTB, along with ablation study results.
Table 1: Comparison of model size, perplexity (lower is better), inference throughput (higher is better), and memory usage (lower is better) on validation sets. Ablation variants omit the frequency-domain STFT loss (Ξ²=0 π½ 0\beta=0 italic_Ξ² = 0) and the NLL loss (Ξ³=0 πΎ 0\gamma=0 italic_Ξ³ = 0) during training, respectively.
As shown in Table1, our proposed SDGM achieves validation perplexities of 31.2 on WikiText-2 and 57.1 on PTB. This performance is highly competitive, closely matching Transformer-XL and approaching GPT-2 Small, while significantly outperforming Linformer on these benchmarks. Crucially, SDGM achieves this with substantially fewer parameters (22.8M) compared to all baselines, particularly GPT-2 Small (80% reduction). It also exhibits significantly lower memory usage (6.5GB vs 8.7-12.5GB) and higher inference throughput (2100 tok/s vs 1200-1800 tok/s), demonstrating the practical benefits of its πͺβ’(Kβ’L)πͺ πΎ πΏ\mathcal{O}(KL)caligraphic_O ( italic_K italic_L ) complexity.
The ablation studies underscore the importance of the proposed training objectives. Removing the frequency-domain STFT loss (Ξ²=0 π½ 0\beta=0 italic_Ξ² = 0) increases perplexity notably (e.g., from 31.2 to 33.5 on WikiText-2), indicating that matching spectral characteristics aids language modeling performance. Removing the language modeling objective itself (Ξ³=0 πΎ 0\gamma=0 italic_Ξ³ = 0) during training severely degrades perplexity, confirming its necessity, although the model can still be trained solely on reconstruction.
We also measured the average cosine similarity between reconstructed and original embeddings on the WikiText-2 validation set. The full SDGM achieves a cosine similarity of 0.92, compared to 0.88 for the variant trained without the frequency-domain loss (Ξ²=0 π½ 0\beta=0 italic_Ξ² = 0). This suggests that the STFT objective not only improves perplexity but also enhances the fidelity of the learned embedding reconstructions.
5 Discussion
The experimental results demonstrate that the Spectral Dictionary Generative Model (SDGM) offers a compelling and efficient alternative to self-attention for sequence modeling in NLP. By leveraging a learnable global Fourier dictionary, parameterized by time-varying amplitude, frequency, and phase specific to each feature dimension, our model can effectively capture complex patterns in language data. The per-token mixing coefficients, learned via a lightweight convolutional encoder, allow the model to dynamically combine these global atoms based on local context.
The πͺβ’(Kβ’L)πͺ πΎ πΏ\mathcal{O}(KL)caligraphic_O ( italic_K italic_L ) complexity (where KβͺL much-less-than πΎ πΏ K\ll L italic_K βͺ italic_L) provides significant computational and memory advantages over the πͺβ’(L 2)πͺ superscript πΏ 2\mathcal{O}(L^{2})caligraphic_O ( italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) complexity of standard self-attention. Our empirical results confirm this, showing SDGM uses approximately 36% less GPU memory during inference than Transformer-XL and achieves up to 1.5β1.75Γ higher token throughput compared to Transformer-XL and GPT-2 Small, respectively. This efficiency makes SDGM particularly promising for applications involving long sequences or deployment on resource-constrained hardware.
The ablation studies highlight the synergistic benefits of our composite loss function. The frequency-domain STFT loss (β freq subscript β freq\mathcal{L}{\text{freq}}caligraphic_L start_POSTSUBSCRIPT freq end_POSTSUBSCRIPT) demonstrably improves both perplexity and embedding reconstruction fidelity, confirming the value of spectral supervision. The standard language modeling loss (β NLL subscript β NLL\mathcal{L}{\text{NLL}}caligraphic_L start_POSTSUBSCRIPT NLL end_POSTSUBSCRIPT) remains crucial for achieving strong predictive performance.
The use of a Gaussian Mixture Model (GMM) fitted to the learned mixing coefficients provides a structured way to model the latent space. Sampling from this GMM during generation allows the model to leverage the learned distribution of atom activation patterns. However, sampling coefficients independently at each time step from the aggregate GMM pβ’(π³)π π³ p(\mathbf{z})italic_p ( bold_z ) is a simplification. While the time-varying nature of the dictionary atoms S kβ’(t)subscript π π π‘ S_{k}(t)italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) provides inherent temporal structure, this generation method might not fully capture longer-range dependencies encoded in the sequence of coefficients. Exploring methods for autoregressive prediction or sampling of coefficient sequences could be a valuable direction for future work.
Interpretability is another potential advantage. The learned atoms S kβ’(t)d subscript π π subscript π‘ π S_{k}(t)_{d}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT (Equation LABEL:eq:atom_definition) are explicit sinusoids, potentially allowing for analysis of the frequencies and phases learned by the model, although further investigation is needed to connect these parameters to linguistic structures.
Several limitations and future directions warrant consideration:
β’ Scalability: While showing promise on medium-sized corpora, SDGMβs performance scaling to massive datasets (e.g., billions of tokens) needs further investigation.
β’ Fixed Dictionary Size: The dictionary size K πΎ K italic_K is a fixed hyperparameter. Exploring adaptive or dynamic mechanisms for determining K πΎ K italic_K could potentially improve the capacity-efficiency trade-off.
β’ Generation Coherence: As noted, the independent sampling of mixing coefficients from the GMM might limit temporal coherence in generation. Investigating autoregressive prediction of coefficients or structured latent variable models could enhance generation quality.
β’ Integration and Hybrid Models: Exploring SDGM as a component within larger architectures, perhaps replacing attention in specific layers or combining it with other mechanisms, could yield further benefits.
6 Conclusion
We have presented the Spectral Dictionary Generative Model (SDGM), a novel architecture for language modeling that replaces self-attention with a learned global Fourier dictionary and sequence-specific mixing coefficients. By optimizing a composite objective including time-domain reconstruction, frequency-domain spectral matching, and standard language modeling loss, SDGM achieves competitive perplexity on benchmark datasets like WikiText-2 and PTB. Notably, it does so with significantly fewer parameters, lower memory consumption, and faster inference speed compared to traditional Transformer baselines, owing to its πͺβ’(Kβ’L)πͺ πΎ πΏ\mathcal{O}(KL)caligraphic_O ( italic_K italic_L ) complexity.
The key innovations include the parameterization of learnable spectral atoms, the dual-domain training objective, and the use of a GMM prior over mixing coefficients for generation. Our results suggest that learned spectral dictionary methods represent a viable and highly efficient paradigm for sequence modeling in NLP. This approach opens avenues for developing powerful language models suitable for long-context processing and deployment in resource-constrained environments.
Future work includes scaling SDGM to larger datasets, enhancing the generative modeling of coefficient sequences, exploring the interpretability of the learned spectral atoms, and potentially integrating SDGM components into hybrid architectures.
References
- [1] Michal Aharon, Michael Elad, and Alfred Bruckstein. K-svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11):4311β4322, 2006.
- [2] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2010β2022, 2020.
- [3] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. In Advances in Neural Information Processing Systems, volume 32, pages 1179β1188, 2019.
- [4] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Luke Hawkins, Jakub Davis, Sanjiv Mohiuddin, Εukasz Kaiser, David Belanger, and Ilya Sutskever. Rethinking attention with performers. In International Conference on Learning Representations, 2021.
- [5] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978β2988, 2019.
- [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACLβHLT, pages 4171β4186, 2019.
- [7] John Fader, Xin Lee, and Michael Tang. Fourier-recurrent neural networks for long-range time series modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2020.
- [8] Luka Frantar, Damian Novak, and Robert Kalman. Linear-time transformers via structured fourier kernel approximation. In Proceedings of the International Conference on Machine Learning, 2023.
- [9] Panos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and FranΓ§ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning, pages 5156β5165, 2020.
- [10] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proc. ICLR, 2015.
- [11] Andrew Kiruluta. Spectral dictionary learning for generative image modeling. in review, 2025.
- [12] Andrew Kiruluta, Priscilla Burity, and Samantha Williams. Learnable multi-scale wavelet transformer: A novel alternative to self-attention. arXiv:2504.03821, 2025.
- [13] Nikita Kitaev, Εukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In Proceedings of the International Conference on Learning Representations, 2020.
- [14] James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon. Fnet: Mixing tokens with fourier transforms. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 3816β3823. Association for Computational Linguistics, July 2022.
- [15] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, MartΓn Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, volume 32, pages 8024β8035, 2019.
- [16] A.Radford, K.Narasimhan, T.Salimans, and I.Sutskever. Improving language understanding by generative pre-training, 2018. OpenAI Blog.
- [17] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 2019. https://openai.com/blog/better-language-models.
- [18] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Εukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, pages 5998β6008, 2017.
- [19] Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv:2006.04768 [cs.CL], 2020.
- [20] Manzil Zaheer, Guru Guruganesh, Karan Dubey, Joshua Ainslie, Chris Alberti, Saurabh Joshi, Tristan Pham, Kanad Ravula, Shaowei Wang, Li Yang, and Others. Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems, volume 33, pages 17283β17297, 2020.
Xet Storage Details
- Size:
- 59.5 kB
- Xet hash:
- b5a2ae79b373650482428fe27a9adb0a899c1f31179406610a8f28c5641ad51a
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.
