Buckets:
Title: Converting Transformers into DGNNs Form
URL Source: https://arxiv.org/html/2502.00585
Markdown Content: Back to arXiv
This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.
Why HTML? Report Issue Back to Abstract Download PDF Abstract 1Introduction 2Related Work 3Background and Preliminary 4Proposed Method 5Experiments 6Conclusion and Future Work References
HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.
failed: commath
Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.
License: CC BY 4.0 arXiv:2502.00585v3 [cs.LG] 03 Mar 2025 Converting Transformers into DGNNs Form Jie Zhang National Central University, Taiwan hazdzz@g.ncu.edu.tw &Mao-Hsuan Mao National Central University, Taiwan mmh.nuss@gmail.com &Bo-Wei Chiu National Central University, Taiwan h23468270@g.ncu.edu.tw &Min-Te Sun National Central University, Taiwan msun@csie.ncu.edu.tw Abstract
Recent advances in deep learning have established Transformer architectures as the predominant modeling paradigm. Central to the success of Transformers is the self-attention mechanism, which scores the similarity between query and key matrices to modulate a value matrix. This operation bears striking similarities to digraph convolution, prompting an investigation into whether digraph convolution could serve as an alternative to self-attention. In this study, we formalize this concept by introducing a synthetic unitary digraph convolution based on the digraph Fourier transform. The resulting model, which we term Converter, effectively converts a Transformer into a Directed Graph Neural Network (DGNN) form. We have tested Converter on Long-Range Arena benchmark, long document classification, and DNA sequence-based taxonomy classification. Our experimental results demonstrate that Converter achieves superior performance while maintaining computational efficiency and architectural simplicity, which establishes it as a lightweight yet powerful Transformer variant.
1Introduction
Through the diligent efforts of researchers, Transformers (Vaswani et al., 2017) have played a crucial role in addressing natural language processing tasks (Devlin et al., 2019) since their inception. At the heart of Transformers is the self-attention mechanism that utilizes the scaled dot-product of a query matrix and a key matrix to generate similarity scores, which guide a value matrix. This enables Transformers to capture long-range dependencies and perform parallel computation. Recently, Transformers have even dominated computer vision (Dosovitskiy et al., 2021) and the biology domain (Nambiar et al., 2020).
Because the softmax function binds a query matrix and a key matrix together to compute attention scores (Vaswani et al., 2017), the high computational cost of self-attention hinders its ability to handle large datasets. Recently, researchers have focused on developing self-attention alternatives with lower time complexity. Approximating the softmax function via kernel functions is a popular choice (Tsai et al., 2019; Choromanski et al., 2021; Qin et al., 2022). However, this may cause the attention matrix in each attention layer to become low-rank. In this situation, the expressive capability of Transformers significantly declines (Dong et al., 2021).
The necessity of the softmax function in self-attention has been questioned. First, the softmax function does not have sufficient ability to express the true data distribution, which constrains the representational capacity of language models and leads to the softmax bottleneck issue (Yang et al., 2018). Moreover, the softmax function inherently fails at robust reasoning across all inputs due to coefficient dispersion with increasing input elements (Veličković et al., 2024).
Therefore, replacing self-attention has become another well-known option (Tay et al., 2021a; Lee-Thorp et al., 2022; Yu et al., 2022a). However, achieving high-rank or full-rank attention matrices while maintaining a time complexity lower than quadratic is a challenge. We address this challenge via synthetic digraph convolution 1. Since self-attention is closely related to digraph convolution (Zaheer et al., 2020), why not replace self-attention with digraph convolution? This work is based on this hypothesis. We synthesize a unitary digraph convolution called Synvolution, which achieves high performance while maintaining linearithmic time complexity for long sequences. We also apply the kernel polynomial method (Silver and Röder, 1994; Wang, 1994; Wang and Zunger, 1994; Vijay et al., 2004; Weiße et al., 2006; Weiße and Fehske, 2008) as an alternative technique for the multi-head operation. We refer to Synvolution with the kernel polynomial method as Kernelution. Supported by the theoretical foundation of the kernel polynomial method, the filter of Kernelution can function as any corresponding unitary filter required by distinct datasets.
In this work, we introduce Converter, a Transformer that is converted into a DGNN form. We have evaluated Converter on Long-Range Arena benchmark (Tay et al., 2021b), long document classification (Yu et al., 2022a), and DNA sequence-based taxonomy classification (Yu et al., 2022a). The experimental results demonstrate that Converter outperforms previous Transformer variants. This demonstrates that Converter is a lightweight, efficient, and powerful neural network. Our key contributions are summarized as follows:
•
We propose Synvolution, a novel self-attention alternative with linearithmic time complexity.
•
We apply the kernel polynomial method as an alternative to the multi-head operation, and propose kernel polynomial loss to simulate a dynamic kernel. We name Synvolution with the kernel polynomial method as Kernelution.
•
We propose Gated Feed-Forward Networks in place of the vanilla Feed-Forward Networks for complex-valued input and real-valued output.
•
We apply PostNorm with ScaleNorm for both real-valued and complex-valued tensor.
2Related Work
Self-Attention Alternatives. Central to self-attention is the scaled dot-product operation performed on a pair of query and key matrices, yielding an affinity matrix with a softmax function. Due to its high time complexity, various alternative methodologies have been proposed to approximate or replace scaled dot-product attention. Approximate approaches typically involve sparse (Child et al., 2019; Kitaev et al., 2020; Zaheer et al., 2020), low-rank (Choromanski et al., 2021), or sparse + low-rank (Chen et al., 2021a) techniques. For replacements, common approaches include convolution (Yu et al., 2022b), pooling (Yu et al., 2022c), and discrete Fourier transform (Lee-Thorp et al., 2022). Recently, a spatial construction method (Tay et al., 2021a; Yu et al., 2022a) has emerged, utilizing the inverse process of matrix decomposition to synthesize attention matrices. This approach has inspired us to propose a synthetic attention.
Multi-Head Attention. Multi-head attention may not be more efficient than single-head. Michel et al. (2019) discover that over 50% attention heads can be pruned during the testing phase. Similarly, Voita et al. (2019) conclude that only a small subset of heads is critical for translation tasks. Furthermore, Cordonnier et al. (2020) identify redundant feature representations in multi-head self-attention. Additionally, Bhojanapalli et al. (2020) observe that a large number of heads cause a low-rank bottleneck, which restricts the representation capacity of Transformers. Based on these observations, we focus on single-head attention.
Digraph Fourier Transform. The Digraph Fourier Transform serves as the cornerstone for digraph convolution, based on the fundamental assumption that it requires a diagonalizable digraph shift operator (Isufi et al., 2024). There are two distinct categories of methods that aim to achieve the digraph Fourier transform. The first category involves replacing the eigendecomposition with an alternative matrix decomposition to build orthogonal or unitary bases (Sandryhaila and Moura, 2013a, b, 2014; Singh et al., 2016), while the second entails spatially constructing a normal digraph shift operator (Chung, 2005; Fanuel et al., 2017, 2018). We draw inspiration from the two categories of methods and propose a unique one that synergizes both approaches.
Structured Matrices. In random and linear projections, structured matrices are aimed to reduce time complexity. One major effect in random projection is improving Johnson-Lindenstrauss transform (Ailon and Chazelle, 2006; Dasgupta et al., 2010). Another prominent effect of structured matrices is random features (Le et al., 2013; Yu et al., 2016). By replacing random entities with learnable parameters, linear projection has been in developed recent years. A well-known example is the Adaptive Fastfood transform (Yang et al., 2015), which extends the vanilla Fastfood transform (Le et al., 2013) by incorporating learnable diagonal matrices. Building on these developments, Moczulski et al. (2016) unify these structured matrices under the SELL matrix family and introduce ACDC and AFDF matrices. These approaches enlighten us to design structured unitary matrices under quadratic time complexity.
3Background and Preliminary 3.1Self-Attention and Multi-Head Self-Attention
Given an input signal 𝐗 ∈ ℝ 𝑁 × 𝐷 , the self-attention (SA) (Vaswani et al., 2017) is defined as
SA ( 𝐗 )
Softmax ( 𝐗𝐖 Q ( 𝐗𝐖 K ) T 𝜏 ) 𝐗𝐖 V ,
(1)
where 𝜏
𝐷 ℎ is the temperature parameter, the softmax function is applied row-wise, T represents the transpose operation, 𝐗𝐖 Q ∈ ℝ 𝑁 × 𝐷 ℎ , 𝐗𝐖 K ∈ ℝ 𝑁 × 𝐷 ℎ , and 𝐗𝐖 V ∈ ℝ 𝑁 × 𝐷 ℎ are referred to as the query, key, and value matrix, in which 𝐖 Q ∈ ℝ 𝐷 × 𝐷 ℎ , 𝐖 K ∈ ℝ 𝐷 × 𝐷 ℎ , and 𝐖 V ∈ ℝ 𝐷 × 𝐷 ℎ are learnable parameters. Conventionally, the Multi-Head Self-Attention (MHSA) with 𝐻 heads is commonly chosen to enhance performance. It is defined as
MHSA ( 𝐗 )
( ∥ ℎ
1 𝐻 SA ( 𝐗 ) ℎ ) 𝐖 O ,
(2)
where ∥ represents the concatenation operation, and 𝐖 O ∈ ℝ 𝐻 𝐷 ℎ × 𝐷 is a learnable parameter. Here, 𝐷 ℎ
𝐷 / 𝐻 .
3.2Digraph Signal Processing
In this work, we follow the definitions of digraph signal processing (DGSP) (Isufi et al., 2024; Sandryhaila and Moura, 2013a, b, 2014). A digraph is represented as 𝐺
{ 𝒱 , ℰ } , where 𝒱 denotes the set of vertices with | 𝒱 |
𝑁 , and ℰ ⊆ 𝒱 × 𝒱 represents the set of edges. A digraph adjacency matrix is denoted by 𝐀 ∈ ℂ 𝑁 × 𝑁 , where each element corresponds to an edge, and the module of the weight signifies the degree of the edge.
In DGSP, a digraph shift operator (DGSO) 𝐒 𝐺 is a matrix defining the manner in which a digraph signal transitions from one node to its neighboring nodes based on the underlying digraph topology. More precisely, a DGSO constitutes a local operator that substitutes the digraph signal value at each node with a linear combination of its neighboring nodes’ values. It is a fundamental assumption to employ a diagonalizable (normalized) digraph adjacency or Laplacian matrix as a DGSO to perform a digraph convolution. In this situation, every digraph signal can be represented as a linear combination of the eigenvectors of the DGSO.
A digraph filter ℋ 𝜃 ( 𝐒 𝐺 ) ∈ ℂ 𝑁 × 𝑁 is a function of the DGSO termed the digraph frequency response function where eigenvalues are perceived as digraph frequencies. In DGSP, a kind of widely adopted digraph filter is based on polynomials, such a digraph filter is defined as ℋ 𝜃 ( 𝐒 𝐺 )
∑ 𝑘
0 𝐾 𝜃 𝑘 𝐒 𝐺 𝑘 , where 𝜃 𝑘 is the corresponding coefficient. This form of digraph filter is called the Finite Impulse Response (FIR) filter. A typical FIR filter is based on the Chebyshev polynomials of the first kind (Hammond et al., 2011) 2. For input 𝑥 ∈ [ − 1 , 1 ] , through three-term recurrence relations, the Chebyshev polynomials are obtained as 𝑇 𝑘 ( 𝑥 )
2 𝑥 ⋅ 𝑇 𝑘 − 1 ( 𝑥 ) − 𝑇 𝑘 − 2 ( 𝑥 ) , with 𝑇 0 ( 𝑥 )
1 and 𝑇 1 ( 𝑥 )
𝑥 . Combining a polynomial based filter, the procedure of the digraph convolution (DGConv) can be articulated as
DGConv ( 𝐒 𝐺 , 𝐗 )
𝐔 − 1 [ ℋ 𝜃 ( 𝚲 ) ⊙ ( 𝐔𝐗 ) ] ,
(3)
where 𝐔 ∈ ℂ 𝑁 × 𝑁 is the eigenvector matrix of 𝐒 𝐺 . 𝐗 ^
𝐔𝐗 ∈ ℂ 𝑁 × 𝐷 represents the digraph Fourier transform applied to the input signals, while 𝐗
𝐔 − 1 𝐗 ^ ∈ ℂ 𝑁 × 𝐷 denotes the inverse transform.
4Proposed Method Figure 1:Converter architecture. 4.1Synvolution
Let 𝐀
𝐗𝐖 Q ( 𝐗𝐖 K ) T / 𝜏 ∈ ℝ 𝑛 × 𝑛 be an affinity matrix, the attention matrix 𝒜
softmax ( 𝐀 ) in Equation 1 can be reformulated as a right stochastic normalized affinity form SA ( 𝐗 )
𝐃 ~ − 1 𝐀 ~ 𝐗𝐖 V , where 𝐀 ~
exp ( 𝐀 ) ∈ ℝ 𝑛 × 𝑛 is defined as the affinity matrix after the element-wise exponentiation operation exp ( ⋅ ) , and 𝐃 ~ 𝑢 , 𝑢
∑ 𝑣 𝐀 ~ 𝑢 , 𝑣 ∈ ℝ 𝑛 × 𝑛 is the corresponding degree matrix. When treating 𝐀 ~ as a digraph adjacency matrix, we found that self-attention closely resembles digraph convolution. First, each element in either an attention matrix or a DGSO can be considered as a similarity from source entity to target entity. Second, both self-attention and digraph convolution can be degenerated to graph convolution form. For self-attention, this occurs when the query matrix is equal to the key matrix in each head, resulting in unidirectional symmetric self-attention. Similarly, for digraph convolution, the achievement of graph convolution can be implemented by symmetrizing the adjacency matrix of a digraph. Third, the softmax function in self-attention results in a row-wise normalized digraph adjacency form.
Since digraph convolution closely resembles self-attention, we investigated replacing self-attention with digraph convolution. Under this hypothesis, a Transformer can be converted into a DGNN form. Based on this insight, we propose Converter. In this work, we decide to construct the DGSO directly. We develop a learnable unitary matrix as a DGSO through the inverse process of eigendecomposition. Our method consists of two phases. In the first phase, we synthesize the required eigenvalues through the following process.
e i 𝚲
exp [ i ⋅ diag ( pool avg [ SIREN ( 𝐗 ) ] ) ] .
(4)
Here, SIREN represents a 2-layer MLP with the sine function (Sitzmann et al., 2020), pool avg ( ⋅ ) is a 1D global average pooling, and diag ( ⋅ ) is a diagonalize operation. We adopt the sine function because it demonstrates a remarkable ability in signal processing (Sitzmann et al., 2020).
In the second phase, we focus on constructing the necessary unitary eigenvector matrix through the inverse process of LQ factorization. Based on the Givens rotation method (Givens, 1958), an arbitrary square matrix 𝚽 ∈ ℂ 𝑁 × 𝑁 can be decomposed into a product of a lower triangular matrix and Givens rotation matrices. Hence, we have
𝚽
𝐋𝐐
𝐋 ( ∏ 𝑗
𝑁 2 ∏ 𝑖
𝑗 − 1 1 𝐆 𝑖 , 𝑗 ) ,
(5)
where 𝐋 ∈ ℂ 𝑁 × 𝑁 is a lower triangular matrix, and 𝐆 𝑖 , 𝑗 ∈ ℂ 𝑁 × 𝑁 is a Givens rotation matrix that resembles an identity matrix with the exception of the elements
[ 𝐺 𝑖 𝑖
𝐺 𝑖 𝑗
𝐺
𝑗
𝑖
𝐺
𝑗
𝑗
]
[ 𝑐 ¯
− 𝑠
𝑠
¯
𝑐
]
[ e − i ( 𝛼 + 𝛽 2 ) cos ( 𝛾 2 )
− e i ( 𝛼 − 𝛽 2 ) sin ( 𝛾 2 )
e − i ( 𝛼 − 𝛽 2 ) sin ( 𝛾 2 )
e i ( 𝛼 + 𝛽 2 ) cos ( 𝛾 2 ) ] ,
(6)
which characterized by parameters 𝛼 , 𝛽 , and 𝛾 ∈ [ 0 , 2 𝜋 ] . This methodology necessitates ( 𝑁 ( 𝑁 − 1 ) ) / 2 pairs of Givens rotation matrices, i.e., it requires 𝒪 ( 𝑁 2 ) space complexity. By reorganizing Givens rotation matrices, inserting permutation matrices, and repeating the patten, we have
𝚽
𝐋 ( ∏ 𝑙
1 𝐿 ( ∏ 𝑖
𝑁 − 1 1 𝐆 𝑖 , 𝑖 + 1 ( 𝑙 ) ) ( ∏ 𝑗
1 𝑁 − 1 𝐆 𝑗 , 𝑗 + 1 ( 𝑙 ) ) 𝐏 ( 𝑙 ) )
= 𝐋 ( ∏ 𝑙
1 𝐿 𝐇 l ( 𝑙 ) 𝐇 u ( 𝑙 ) 𝐏 ( 𝑙 ) ) .
(7)
Here, 𝐇 l ( 𝑙 ) ∈ ℂ 𝑁 × 𝑁 is a lower unitary Hessenberg matrix, 𝐇 u ( 𝑙 ) ∈ ℂ 𝑁 × 𝑁 is an upper unitary Hessenberg matrix, and 𝐏 ( 𝑙 ) ∈ ℝ 𝑁 × 𝑁 is a permutation matrix that either learnable (Mena et al., 2018), fixed (Dao et al., 2022), or even an identity matrix 𝐈 𝑁 ∈ ℝ 𝑁 × 𝑁 .
We refer to Equation 7 as the order- 𝐿 LHHP parametrization, 𝐿 -LHHP for short, denoted by 𝚽 𝐿 − LHHP . In particular, when the lower triangular matrix 𝐋 degenerates to a diagonal matrix 𝐃 , we term this pattern the order- 𝐿 DHHP parametrization, 𝐿 -DHHP for short, denoted by 𝚽 𝐿 − DHHP . It requires 2 𝐿 ( 𝑁 − 1 ) pairs of Givens rotation matrices, which means the space complexity is 𝒪 ( 𝐿 𝑁 ) . We observed that each unitary factor matrix resulting from the multiplication of lower and upper unitary Hessenberg matrices in the order- 𝐿 DHHP parametrization is dense rather than sparse, unlike the schemes proposed in (Khalitov et al., 2022). Since our method is based on the Givens rotation method, we make Assumption 1. Under this assumption, we can establish the following propositions.
Assumption 1.
For constructing an arbitrary 𝑁 × 𝑁 dense unitary matrix, at most ⌈ 𝑁 4 ⌉ orders are sufficient for 𝐿 -DHHP.
Proposition 2.
𝐿 -DHHP captures the discrete unitary transforms, including discrete Fourier transform (DFT), the discrete Walsh–Hadamard transform (DWHT), the discrete cosine transform (DCT), the discrete sine transform (DST), and their inverses exactly.
Proposition 3.
Given an input signal 𝐱 ∈ ℂ 𝑁 and an output signal 𝐲 ∈ ℂ 𝑁 , the time complexity of 𝐿 -DHHP as a discrete unitary transform with the fast implementation as 𝐲
𝚽 𝐱 is 𝒪 ( 𝐿 𝑁 log 𝑁 ) .
Proposition 4.
𝐿 -DHHP is full-rank if and only if the diagonal matrix 𝐃 is unitary.
We refer to this self-attention alternative as Synvolution:
Synv ( 𝐗𝐖 V )
𝚽 − 1 [ exp ( i 𝚲 ) ⊙ ( 𝚽 𝐗𝐖 V ) ] ,
(8)
where 𝐗𝐖 V ∈ ℂ 𝑁 × 𝐷 is denoted as the value matrix. Unlike FFT-based convolution (Mathieu et al., 2013), where the discrete unitary matrix is fixed and data-independent, the required parameters in 𝐿 -DHHP are learnable and data-dependent. We adopt a similar processing method to that described in Equation 4 to obtain the synthetic eigenvector matrix. For convenience, we set 𝐿
1 , 𝐏 ( 1 )
𝐈 𝑁 , and 𝐃 is unitary to obtain a dense unitary matrix that serves as the desired unitary eigenvector matrix. More details about the fast implementation of 1 -DHHP as a discrete unitary transform are provided in the appendix.
4.2Kernelution Figure 2:Illustration of the entire Kernelution process. 4.2.1Chebyshev Polynomial Interpolation
The multi-head operation, a common approach to enhance performance in Transformers, lacks solid theoretical support. In the contrast, FIR filters have a theoretical support in spectral graph theory (Chung, 1997). Let 𝑓 ( 𝑥 ) be the target function, then our goal is to approximate it with the smallest round-off error. Directly manipulating orthogonal polynomials to filter complex-valued signals is challenging, but using them to represent the argument function of signals is straightforward. To achieve it, we can choose an arbitrary orthogonal polynomial basis such as the Bernstein basis, Jacobi basis (including Chebyshev, Gegenbauer, Legendre, and Zernike bases), or even monomial basis. Consider the Chebyshev basis as an example. Given an arbitrary continuous function 𝑓 ( 𝑥 ) ∈ 𝐶 ( [ − 1 , 1 ] ) and a truncated Chebyshev polynomial 𝑝 with 𝐾 orders, then the target function 𝑓 ( 𝑥 ) can be approximated as
𝑓 ( 𝑥 ) ≈ 𝑝 ( 𝑥 )
1 2 𝜇 0 + ∑ 𝑘
1 𝐾 𝜇 𝑘 𝑇 𝑘 ( 𝑥 ) ,
(9)
where 𝜇 𝑘 ≈ 2 𝐾 + 1 ∑ 𝑗
0 𝐾 𝑓 ( 𝑥 𝑗 ) 𝑇 𝑘 ( 𝑥 𝑗 ) is the Chebyshev coefficient, and 𝑥 𝑗 is the sampling Chebyshev node. This technique is termed the Chebyshev polynomial interpolation (CPI) (Trefethen, 2019). The operation on Chebyshev polynomial interpolation is considerably straightforward since Chebyshev polynomials are isomorphic with Fourier series. For differentiable or analytic functions, we have the following theorems.
Theorem 5 (CPI for differentiable functions (Trefethen, 2019)).
Let 𝜐 ≥ 0 and 𝜅 > 𝜐 be integers. Consider a function 𝑓 ( 𝑥 ) whose derivatives up to order 𝜐 − 1 are absolutely continuous on [ − 1 , 1 ] , and suppose ∥ d 𝜐 d 𝑥 𝜐 𝑓 ( 𝑥 ) ∥ 1
Υ . For the 𝜅 -th degree Chebyshev interpolant 𝑝 ( 𝑥 ) , the following bounds hold: (1) ∥ 𝜇 𝜅 ∥ ≤ 2 Υ 𝜋 ( 𝜅 − 𝜐 ) 𝜐 + 1 . (2) ∥ 𝑓 ( 𝑥 ) − 𝑝 ( 𝑥 ) ∥ ≤ 4 Υ 𝜋 𝜐 ( 𝜅 − 𝜈 ) 𝜐 .
Theorem 6 (CPI for analytic functions (Trefethen, 2019)).
Let 𝜅 ≥ 1 be an integer and 𝑓 ( 𝑥 ) an analytic function on [ − 1 , 1 ] that extends analytically to the open Bernstein ellipse 𝐸 𝜌 with ∥ 𝑓 ( 𝑥 ) ∥ ≤ 𝑀 for some 𝑀 . For the 𝜅 -th degree Chebyshev interpolant 𝑝 ( 𝑥 ) , the following bounds hold: (1) ∥ 𝜇 0 ∥ ≤ 𝑀 . (2) ∥ 𝜇 𝜅 ∥ ≤ 2 𝑀 𝜌 − 𝜅 . (3) ∥ 𝑓 ( 𝑥 ) − 𝑝 ( 𝑥 ) ∥ ≤ 4 𝑀 𝜌 − 𝜅 𝜌 − 1 .
Both Theorem 5 and Theorem 6 tell us that we can utilize the Chebyshev polynomial filter to approximate any continuous target function that lies in the range of 𝐶 [ − 1 , 1 ] with a small round-off error.
4.2.2Kernel Polynomial Method
In reality, the target function is probably discontinuous or singular in the polynomial interpolation interval. In this situation, the accuracy of the Chebyshev polynomial interpolation reduces to 𝒪 ( 1 ) near discontinuities or singularities. Sufficiently far away from discontinuities or singularities, the convergence will be slowed to 𝒪 ( 𝐾 − 1 ) . During the approximation process, oscillations will be present near discontinuities or singularities and they will not diminish as 𝐾 → ∞ . This type of oscillation is termed the Gibbs oscillation, and this situation is known as the Gibbs phenomenon (Hewitt and Hewitt, 1979).
To mitigate Gibbs oscillations, we apply a Gibbs damping factor 𝑔 𝑘 , which represented as a function of 𝑘 𝐾 + 1 , to each term of the Chebyshev polynomials. For any 𝑓 ( 𝑥 ) , we have
𝑓 ( 𝑥 ) ≈ 𝑝 KP ( 𝑥 )
1 2 𝑔 0 𝜇 0 + ∑ 𝑘
1 𝐾 𝑔 𝑘 𝜇 𝑘 𝑇 𝑘 ( 𝑥 ) .
(10)
This modification of the Chebyshev coefficients is equivalent to the convolution of 𝑝 ( 𝑥 ) with a kernel 𝒦 ( 𝑥 , 𝑥 0 )
2 𝜋 1 − 𝑥 2 ( 1 2 𝑔 0 + ∑ 𝑘
1 𝐾 𝑔 𝑘 𝑇 𝑘 ( 𝑥 ) 𝑇 𝑘 ( 𝑥 0 ) ) that 𝑝 KP ( 𝑥 )
∫ − 1 1 𝒦 ( 𝑥 , 𝑥 0 ) 𝑓 ( 𝑥 0 ) d 𝑥 0 . Thus, this method is also called the kernel polynomial method. It is widely employed in computational physics for calculating the density of states and other spectral properties of large quantum systems.
Gibbs damping factors are a family of coefficients that satisfy three conditions: (1) 𝑔 𝑘 > 0 . (2) 𝑔 0
1 . (3) lim 𝐾 → ∞ 𝑔 1 → 1 . The conditions (1) and (2) are particularly valuable in real-world applications (Weiße et al., 2006; Weiße and Fehske, 2008). The first condition ensures that approximations of positive quantities remain positive, while the second conserves the integral of the expanded function ∫ − 1 1 𝑝 KPM ( 𝑥 ) d 𝑥
∫ − 1 1 𝑓 ( 𝑥 ) d 𝑥 . Notably, 𝑔 𝑘
1 is the simplest Gibbs damping factor attributed to the Dirichlet kernel. More details about Gibbs damping factors are in the appendix.
Clearly, finding an appropriate kernel is crucial for approximation, as it determines whether the round-off error is minimized or not. As indicated in (Weiße et al., 2006; Weiße and Fehske, 2008), kernel choices are data-dependent. More specifically, given a target function, we need to match an appropriate kernel and manually tune its hyperparameters (if the kernel has any) based on experience. Since the target function is unknown, we relax each 𝜇 𝑘 with a learnable parameter 𝑤 𝑘 . The effectiveness of the Gibbs damping factors lie in their ability to reduce the weight of each term of the Chebyshev coefficients, thereby mitigating the contributions of higher-order terms. Based on this observation, and in order to prevent over-fitting, we propose the following loss function which is named the kernel polynomial loss (KPL):
ℒ KP
∫ − 1 1 \abs d 𝑓 ( 𝑥 ) d 𝑥 2 d 𝑥 ≈ ∑ 𝑘
1 𝐾 𝜋 𝑘 2 \abs 𝑤 𝑘 2 .
(11)
This results in an intuitive penalty applied to the Chebyshev coefficients, with higher order Chebyshev coefficients incurring greater penalties than the lower ones. It causes the Chebyshev polynomial interpolation with the kernel polynomial loss to simulate the kernel polynomial method with a learnable kernel. We apply the kernel polynomial method with Synolution, which turns out what we call Kernelution. The corresponding formula is defined as
Kern ( 𝐗𝐖 V )
𝚽 − 1 [ exp ( i ⋅ 𝑝 KP ( 𝚲 ) ) ⊙ ( 𝚽 𝐗𝐖 V ) ] .
(12)
It is worth noting that the kernel polynomial method is not the only operation compatible with Synvolution. Depending on practical requirements, Synvolution can also be made compatible with the multi-head operation, similar to other attention mechanisms. This means Synvolution can be equipped as a substitute for self-attention in Transformer-based models.
4.3Gated Feed-Forward Network and PostScaleNorm
Both Synvolution and Kernelution effectively represent the direction and model the relationship between feature tokens in the spectral domain. A tricky problem is that the output of either Synvolution or Kernelution is complex-valued, whereas the labels are real-valued. This conflict motivates us to design a layer that maps a complex-valued tensor into a real-valued tensor. We propose a Gated Feed-Forward Network (GFNN) to solve this issue.
GFFN ( 𝐗 )
[ softplus ( ℜ ( 𝐗 ) 𝐖 ℜ ) ⊙ tanh ( ℑ ( 𝐗 ) 𝐖 ℑ ) ] 𝐖 O ,
(13)
where 𝐖 ℜ ∈ ℝ 𝐷 × 𝐷 hid , 𝐖 ℑ ∈ ℝ 𝐷 × 𝐷 hid and 𝐖 O ∈ ℝ 𝐷 hid × 𝐷 are trainable weight matrices. We let the real part to learn the magnitude, and the imaginary part to learn the sign. Besides, we apply the PostNorm architecture (Wang et al., 2019) with ScaleNorm (Nguyen and Salazar, 2019) across the whole model, namely PostScaleNorm. Specifically, we apply ScaleNorm ( 𝐙 + 𝜁 ⋅ ℜ ( 𝐙 ) + ( 1 − 𝜁 ) ⋅ ℑ ( 𝐙 ) ) for a complex-valued signal 𝐙 , where 𝜁 ∈ [ 0 , 1 ] is a learnable parameter.
5Experiments
In this section, we test the scalability and performance of Converter in three different domains: (1) Long-Range Arena benchmark, (2) long document classification, and (3) DNA sequence-based taxonomy classification. We conducted all experiments on a NVIDIA DGX-1 equipped with two 20-core Intel Xeon E5-2698 v4 CPUs @ 2.2 GHz, 512 GB of RAM, and 8 NVIDIA Tesla V100 GPUs, each with 16 GB of GPU memory. The code is implemented using PyTorch (Paszke et al., 2019). Following Neishi and Yoshinaga (2019), we adopt a 2-layer GRU for position embedding, which is denoted as RPE. We adopt the AdamW optimizer (Loshchilov and Hutter, 2019) and apply cross-validation to report the best hyperparameters. We apply the following loss functions as the metric to evaluate our model.
ℒ
( 1 − 𝜂 ) ⋅ ℒ CE + 𝜂 ⋅ ℒ KP
(14)
Here, 𝜂 ∈ [ 0 , 1 ) is a tunable hyperparameter that needs to be selected manually, and ℒ CE denotes cross-entropy loss.
5.1Long-Range Arena Benchmark Table 1:Accuracy results (%) on the Long-Range Arena benchmark. The best result is in bold and the second best is underlined. Model ListOps ↑ Text ↑ Retrieval ↑ Image ↑ Pathfinder ↑ Avg. ↑
Vanilla Trans. (Vaswani et al., 2017) 36.37 64.27 57.46 42.44 71.40 54.39 Sparse Trans. (Child et al., 2019) 17.07 63.58 59.59 44.24 71.71 51.24 Reformer (Kitaev et al., 2020) 37.27 56.10 53.40 38.07 68.50 50.67 Longformer (Beltagy et al., 2020) 35.63 62.85 56.89 42.22 69.71 53.46 Linformer (Wang et al., 2020) 35.70 53.94 52.27 38.56 76.34 51.36 BigBird (Zaheer et al., 2020) 36.05 64.02 59.29 40.83 74.87 55.01 Linear Trans. (Katharopoulos et al., 2020) 16.13 65.90 53.09 42.34 75.30 50.55 Sinkhorn Trans. (Tay et al., 2020) 33.67 61.20 53.83 41.23 67.45 51.29 Performer (Choromanski et al., 2021) 18.01 65.40 53.82 42.77 77.05 51.41 Synthesizer (Tay et al., 2021a) 36.99 61.68 54.67 41.61 69.45 52.88 Nyströmformer (Xiong et al., 2021) 37.15 65.52 79.56 41.58 70.94 58.95 Luna-256 (Ma et al., 2021) 37.98 65.78 79.56 47.86 78.55 61.95 FNet (Lee-Thorp et al., 2022) 35.33 65.11 59.61 38.67 77.80 55.30 cosFormer (Qin et al., 2022) 37.90 63.41 61.36 43.17 70.33 55.23 Paramixer (Chord) (Yu et al., 2022a) 39.71 78.87 78.73 44.68 79.16 58.91 Converter (ours) 60.38 86.44 83.41 61.02 88.43 75.94
The Long-Range Arena (LRA) (Tay et al., 2021b) is a public benchmark established with the aim of evaluating the ability of efficient Transformers to model long-sequence data. This benchmark contains five multi-class classification tasks from distinct domains, including ListOps (Nangia and Bowman, 2018), Text (Maas et al., 2011), Retrieval (Radev et al., 2009), Image (Krizhevsky, 2009), and Pathfinder (Linsley et al., 2018; Kim et al., 2020). ListOps consists of digits, operators such as MAX, MEAN, MEDIAN, and SUM_MOD, and brackets. Each operator in a sequence processes the items in a list and outputs a digit. Text consists of sequences represented at the byte/character-level, which significantly increases its difficulty. In this task, models must classify each review as positive or negative, making it a binary classification task. Retrieval is similar to the Text task with a byte/character-level setting. Image is the CIFAR-10 task (Krizhevsky, 2009) for image classification. The input data consists of sequences of pixels derived from flattening 32 × 32 images into a 1D array with the length of 1024. Pathfinder is motivated by cognitive psychology (Houtkamp and Roelfsema, 2010). In this task, a synthetic image measures 32 × 32 pixels and features two highlighted endpoints depicted as circles, connected by a dashed path. Each image contains distractor paths, adding complexity. The models must determine whether a dashed path connects the two highlighted endpoints. As in the Image task, the input has to be converted into a sequence with length 1024.
In the interest of ensuring a fair comparison, we follow the experiment settings outlined in (Tay et al., 2021b) and evaluate Converter on the aforementioned tasks. For baselines, we include the vanilla Transformer (Vaswani et al., 2017) and 14 Transformer variants: Sparse Transformer (Child et al., 2019), Reformer (Kitaev et al., 2020), Longformer (Beltagy et al., 2020), Linformer (Wang et al., 2020), BigBird (Zaheer et al., 2020), Linear Transformer (Katharopoulos et al., 2020), Sinkhorn Transformer (Tay et al., 2020), Performer (Choromanski et al., 2021), Synthesizer (Tay et al., 2021a), Nyströmformer (Xiong et al., 2021), Luna (Ma et al., 2021), FNet (Lee-Thorp et al., 2022), cosFormer (Qin et al., 2022), and Paramixer (Yu et al., 2022a).
As shown in Table 1, Converter consistently surpasses all baseline models in all five tasks with the best classification accuracy: ListOps (60.38%), Text (86.44%), Retrieval (83.41%), Image (61.02%), and Pathfinder (88.43%). Converter attains an average accuracy of 75.94%, substantially outperforming the second-best model Luna-256 (61.95%) by a margin of 14 percentage points. Notably, on the challenging ListOps task, Converter (60.38%) surpasses the second-best performer Paramixer (39.71%) by more than 20.57%, demonstrating its superior capability in handling structured sequential data. Furthermore, for the Image classification task, Converter (61.02%) significantly outperforms the runner-up Luna-256 (47.86%), showcasing its exceptional ability in visual feature extraction. These experimental results confirm the comprehensive advantages that Converter demonstrates in processing long-sequence tasks.
5.2Long Document Classification
This task aims to evaluate the capability of Converter in modeling complex long-term dependencies for NLP tasks. We utilized a publicly available dataset collected from arXiv (Liu et al., 2018). Following Yu et al. (2022a), we selected four document categories: cs.AI, cs.NE, math.AC, and math.GR, yielding a dataset of 11956 documents. The documents were encoded at the character level, and all comparative models employed zero padding to achieve uniform length. The dataset was partitioned into 60% training, 20% validation, and 20% test sets. Similar to the LRA benchmark, we created two standardized versions of the dataset through truncation: LongDoc16K with sequences of 16384 tokens and LongDoc32K with sequences of 32768 tokens. We then evaluated Converter and various Transformer-based architectures on these datasets.
Table 2:Accuracy results (%) on long document classification. The best result is in bold and the second best is underlined. Model LongDoc16K ↑ LongDoc32K ↑
Vanilla Transformer 68.39 70.84 Linformer (Layerwise) 49.03 45.00 Performer 62.26 65.65 Synthesizer (Dense) 76.05 77.02 Nyströmformer 67.26 69.27 FNet 44.92 46.94 cosFormer 64.44 67.26 Paramixer (Chord) 79.60 74.76 Converter 81.77 82.34
Table 2 illustrates that Converter surpasses all baseline models. Notably, synthetic attention-based Transformer variants (Synthesizer, Paramixer, and Converter) provide better results than self-attention approximation-based Transformers (Linformer, Performer, and Nyströmformer) and the vanilla Transformer. Meanwhile, FNet, a parameter-free and attention-free attention-based Transformer variant, performs poorly in long document classification tasks. This demonstrates that high-rank or full-rank attention plays a crucial role in modeling long-term dependencies.
5.3DNA Sequence-based Taxonomy Classification
We evaluated Converter against other models using biological data. We obtained cDNA sequences and their taxonomic labels from Ensembl 3 and designed two binary classification tasks. Similar to the long document classification task, each dataset was truncated to a fixed length of 16384 and partitioned into 60% training, 20% validation, and 20% test sets. The first dataset, Ensembl (B/S), focuses on vertebrate organisms, comparing sequences from the genera Bos and Sus. While the classes are nearly balanced (51973 Bos and 50027 Sus sequences), this dataset is particularly challenging due to extreme variations in sequence length (ranging from 63 to 447010 bases). The second dataset, Ensembl (M/R), represents our most computationally intensive classification task at the genus level. This dataset compares genes from Mus and Rattus, featuring significant class imbalance with a nearly 2:1 ratio (275636 Mus and 133310 Rattus sequences). Sequence lengths vary substantially, spanning from 32 to 261093 bases.
Table 3:Accuracy results (%) on DNA sequence-based taxonomy classification. The best result is in bold and the second best is underlined. Model Ensembl (B/S) ↑ Ensembl (M/R) ↑
Vanilla Transformer 66.71 58.50 Linformer (Layerwise) 62.76 51.63 Performer 63.18 55.16 Synthesizer (Dense) 66.07 56.34 Nyströmformer 66.15 58.51 FNet 65.70 56.30 cosFormer 65.73 56.30 Paramixer (Chord) 66.77 56.37 Converter 84.59 59.49
As shown in Table 3, our model achieves strong performance. In Ensembl (B/S), Converter is 17.82% more accurate than Paramixer. In Ensembl (M/R), Converter achieves an accuracy that is 0.98% higher than Nyströmformer. Moreover, our model consistently surpasses the vanilla Transformer in all two tasks.
5.4Ablation Studies
We study the influence of different mechanisms used in Converter by ablating the corresponding components. Table 4 records the experimental results of Converter equipped with distinct components on the Long-Range Arena benchmark. NoPE means without position embedding, APE means learnable absolute position embedding (Gehring et al., 2017), and SPE means sinusoidal position embedding (Vaswani et al., 2017). The ablation studies highlight the importance of RPE (Neishi and Yoshinaga, 2019). Notably, Converter achieves the highest performance when using PRE, followed by APE and SPE in descending order, while the NoPE variant yields the lowest results. This finding contradicts recent papers regarding NoPE (Haviv et al., 2022; Chi et al., 2023; Kazemnejad et al., 2023). Both versions of Converter with Kernolution (with and without the kernel polynomial loss) outperform Converter with Synvolution in all tasks.
Table 4:Ablation studies of Converter on the LRA benchmark. Method ListOps ↑ Text ↑ Retrieval ↑ Image ↑ Pathfinder ↑
Converter 60.38 86.44 83.41 61.02 88.43 w/ NoPE 36.44 62.31 66.85 41.01 77.48 w/ SPE 37.45 71.15 79.52 46.89 80.55 w/ APE 39.80 79.20 79.82 48.38 80.89 w/ Synv. 57.96 82.89 82.39 58.23 87.29 w/o KPL 59.32 83.60 83.11 59.75 88.18 6Conclusion and Future Work
In this work, we introduce Converter, a Transformer variant that replaces self-attention with a synthetic unitary digraph convolution called Synvolution. By leveraging the inverse process of eigendecomposition and LQ factorization, we synthesize a unitary digraph shift operator with learnable eigenvalues and eigenvectors. Our fast 1 -DHHP implementation achieves linearithmic time complexity while preserving the full-rank property of Synvolution. We further enhance Synvolution by incorporating the kernel polynomial method, resulting in Kernelution. We propose a kernel polynomial loss to enable dynamic kernel adaptation during training.
We evaluate Converter on the Long-Range Arena benchmark, long document classification, and DNA sequence-based taxonomy classification tasks. The experimental results demonstrate the strong performance of Converter. This work takes a solid step forward in applying digraph convolution to large datasets under the architecture of Transformer. Future work will focus on developing the decoder component for cross-attention. We believe our synthetic unitary digraph convolution approach will inspire further research into the relationships among self-attention, digraph convolution, and convolution, opening new possibilities for designing more powerful and efficient neural network architectures that combine the strengths of these different approaches.
References Vaswani et al. [2017] ↑ Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention Is All You Need.In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, pages 5998–6008. Curran Associates, Inc., 2017. Devlin et al. [2019] ↑ Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 06 2019. Dosovitskiy et al. [2021] ↑ Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby.An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.In International Conference on Learning Representations, 2021. Nambiar et al. [2020] ↑ Ananthan Nambiar, Maeve Heflin, Simon Liu, Sergei Maslov, Mark Hopkins, and Anna Ritz.Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks.In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. Association for Computing Machinery, 2020. Tsai et al. [2019] ↑ Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov.Transformer Dissection: An Unified Understanding for Transformer’s Attention via the Lens of Kernel.In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4344–4353. Association for Computational Linguistics, 11 2019. Choromanski et al. [2021] ↑ Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller.Rethinking Attention with Performers.In International Conference on Learning Representations, 2021. Qin et al. [2022] ↑ Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong.cosFormer: Rethinking Softmax In Attention.In International Conference on Learning Representations, 2022. Dong et al. [2021] ↑ Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas.Attention is not all you need: pure attention loses rank doubly exponentially with depth.In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 2793–2803. PMLR, 07 2021. Yang et al. [2018] ↑ Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen.Breaking the softmax bottleneck: A high-rank RNN language model.In International Conference on Learning Representations, 2018. Veličković et al. [2024] ↑ Petar Veličković, Christos Perivolaropoulos, Federico Barbero, and Razvan Pascanu.softmax is not enough (for sharp out-of-distribution).arXiv preprint arXiv: 2410.01104, 2024. Tay et al. [2021a] ↑ Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng.Synthesizer: Rethinking Self-Attention for Transformer Models.In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10183–10192. PMLR, 07 2021a. Lee-Thorp et al. [2022] ↑ James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon.FNet: Mixing Tokens with Fourier Transforms.In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4296–4313. Association for Computational Linguistics, 07 2022. Yu et al. [2022a] ↑ Tong Yu, Ruslan Khalitov, Lei Cheng, and Zhirong Yang.Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention.In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 681–690, 2022a. Zaheer et al. [2020] ↑ Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed.Big Bird: Transformers for Longer Sequences.In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17283–17297. Curran Associates, Inc., 2020. Silver and Röder [1994] ↑ R.N. Silver and H. Röder.Densities of States of Mega-Dimensional Hamiltonian Matrices.International Journal of Modern Physics C, 05(04):735–753, 1994. Wang [1994] ↑ Lin-Wang Wang.Calculating the density of states and optical-absorption spectra of large quantum systems by the plane-wave moments method.Physical Review B, 49:10154–10158, 04 1994. Wang and Zunger [1994] ↑ Lin-Wang Wang and Alex Zunger.Dielectric Constants of Silicon Quantum Dots.Physical Review Letters, 73:1039–1042, 08 1994. Vijay et al. [2004] ↑ Amrendra Vijay, Donald J. Kouri, and David K. Hoffman.Scattering and Bound States: A Lorentzian Function-Based Spectral Filter Approach.The Journal of Physical Chemistry A, 108(41):8987–9003, 10 2004. Weiße et al. [2006] ↑ Alexander Weiße, Gerhard Wellein, Andreas Alvermann, and Holger Fehske.The kernel polynomial method.Reviews of Modern Physics, 78:275–306, 05 2006. Weiße and Fehske [2008] ↑ Alexander Weiße and Holger Fehske.Chebyshev Expansion Techniques, pages 545–577.Springer Berlin Heidelberg, 2008. Tay et al. [2021b] ↑ Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler.Long Range Arena: A Benchmark for Efficient Transformers.In International Conference on Learning Representations, 2021b. Child et al. [2019] ↑ Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever.Generating Long Sequences with Sparse Transformers.arXiv preprint arXiv: 1904.10509, 2019. Kitaev et al. [2020] ↑ Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya.Reformer: The Efficient Transformer.In International Conference on Learning Representations, 2020. Chen et al. [2021a] ↑ Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré.Scatterbrain: Unifying sparse and low-rank attention.In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 17413–17426. Curran Associates, Inc., 2021a. Yu et al. [2022b] ↑ Weihao Yu, Chenyang Si, Pan Zhou, Mi Luo, Yichen Zhou, Jiashi Feng, Shuicheng Yan, and Xinchao Wang.MetaFormer Baselines for Vision.arXiv preprint arXiv: 2210.13452, 2022b. Yu et al. [2022c] ↑ Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan.MetaFormer Is Actually What You Need for Vision.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10819–10829, 06 2022c. Michel et al. [2019] ↑ Paul Michel, Omer Levy, and Graham Neubig.Are Sixteen Heads Really Better than One?In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 14014–14024. Curran Associates, Inc., 2019. Voita et al. [2019] ↑ Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov.Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned.In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5797–5808. Association for Computational Linguistics, 07 2019. Cordonnier et al. [2020] ↑ Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi.Multi-Head Attention: Collaborate Instead of Concatenate.arXiv preprint arXiv: 2006.16362, 2020. Bhojanapalli et al. [2020] ↑ Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar.Low-Rank Bottleneck in Multi-head Attention Models.In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 864–873. PMLR, 07 2020. Isufi et al. [2024] ↑ Elvin Isufi, Fernando Gama, David I Shuman, and Santiago Segarra.Graph filters for signal processing and machine learning on graphs.IEEE Transactions on Signal Processing, pages 1–32, 2024. Sandryhaila and Moura [2013a] ↑ Aliaksei Sandryhaila and José M. F. Moura.Discrete Signal Processing on Graphs.IEEE Transactions on Signal Processing, 61(7):1644–1656, 2013a. Sandryhaila and Moura [2013b] ↑ Aliaksei Sandryhaila and José M. F. Moura.Discrete signal processing on graphs: Graph fourier transform.In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6167–6170, 2013b. Sandryhaila and Moura [2014] ↑ Aliaksei Sandryhaila and José M. F. Moura.Discrete Signal Processing on Graphs: Frequency Analysis.IEEE Transactions on Signal Processing, 62(12):3042–3054, 2014. Singh et al. [2016] ↑ Rahul Singh, Abhishek Chakraborty, and B. S. Manoj.Graph Fourier transform based on directed Laplacian.In 2016 International Conference on Signal Processing and Communications (SPCOM), pages 1–5, 2016. Chung [2005] ↑ Fan Chung.Laplacians and the cheeger inequality for directed graphs.Annals of Combinatorics, 9(1):1–19, 04 2005. Fanuel et al. [2017] ↑ Michaël Fanuel, Carlos M. Alaíz, and Johan A. K. Suykens.Magnetic eigenmaps for community detection in directed networks.Physical Review E, 95:022302, 02 2017. Fanuel et al. [2018] ↑ Michaël Fanuel, Carlos M. Alaíz, Ángela Fernández, and Johan A.K. Suykens.Magnetic Eigenmaps for the visualization of directed networks.Applied and Computational Harmonic Analysis, 44(1):189–199, 2018. Ailon and Chazelle [2006] ↑ Nir Ailon and Bernard Chazelle.Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform.In Proceedings of the Thirty-Eighth Annual ACM Symposium on Theory of Computing, STOC ’06, pages 557–563. Association for Computing Machinery, 2006. Dasgupta et al. [2010] ↑ Anirban Dasgupta, Ravi Kumar, and Tamás Sarlos.A sparse Johnson: Lindenstrauss transform.In Proceedings of the Forty-Second ACM Symposium on Theory of Computing, STOC ’10, pages 341–350. Association for Computing Machinery, 2010. Le et al. [2013] ↑ Quoc Le, Tamas Sarlos, and Alexander Smola.Fastfood — Approximating Kernel Expansions in Loglinear Time.In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 244–252. PMLR, 06 2013. Yu et al. [2016] ↑ Felix Xinnan X Yu, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice, and Sanjiv Kumar.Orthogonal Random Features.In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29, pages 1975–1983. Curran Associates, Inc., 2016. Yang et al. [2015] ↑ Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, and Ziyu Wang.Deep Fried Convnets.In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 12 2015. Moczulski et al. [2016] ↑ Marcin Moczulski, Misha Denil, Jeremy Appleyard, and Nando de Freitas.ACDC: A Structured Efficient Linear Layer.In International Conference on Learning Representations, 2016. Hammond et al. [2011] ↑ David K. Hammond, Pierre Vandergheynst, and Rémi Gribonval.Wavelets on graphs via spectral graph theory.Applied and Computational Harmonic Analysis, 30(2):129–150, 2011. Sitzmann et al. [2020] ↑ Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein.Implicit Neural Representations with Periodic Activation Functions.In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 7462–7473. Curran Associates, Inc., 2020. Givens [1958] ↑ Wallace Givens.Computation of Plain Unitary Rotations Transforming a General Matrix to Triangular Form.Journal of the Society for Industrial and Applied Mathematics, 6(1):26–50, 1958. Mena et al. [2018] ↑ Gonzalo Mena, David Belanger, Scott Linderman, and Jasper Snoek.Learning Latent Permutations with Gumbel-Sinkhorn Networks.In International Conference on Learning Representations, 2018. Dao et al. [2022] ↑ Tri Dao, Beidi Chen, Nimit S Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, and Christopher Ré.Monarch: Expressive Structured Matrices for Efficient and Accurate Training.In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 4690–4721. PMLR, 07 2022. Khalitov et al. [2022] ↑ Ruslan Khalitov, Tong Yu, Lei Cheng, and Zhirong Yang.Sparse factorization of square matrices with application to neural attention modeling.Neural Networks, 152:160–168, 2022. Mathieu et al. [2013] ↑ Michaël Mathieu, Mikael Henaff, and Yann LeCun.Fast Training of Convolutional Networks through FFTs.International Conference on Learning Representations, 2013. Chung [1997] ↑ Fan Chung.Spectral Graph Theory.American Mathematical Society, 1997. Trefethen [2019] ↑ Lloyd N. Trefethen.Approximation Theory and Approximation Practice, Extended Edition.Society for Industrial and Applied Mathematics, 2019. Hewitt and Hewitt [1979] ↑ Edwin Hewitt and Robert E. Hewitt.The Gibbs-Wilbraham phenomenon: An episode in fourier analysis.Archive for History of Exact Sciences, 21(2):129–160, 06 1979. Wang et al. [2019] ↑ Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao.Learning Deep Transformer Models for Machine Translation.In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1810–1822. Association for Computational Linguistics, 07 2019. Nguyen and Salazar [2019] ↑ Toan Q. Nguyen and Julian Salazar.Transformers without Tears: Improving the Normalization of Self-Attention.In Proceedings of the 16th International Conference on Spoken Language Translation. Association for Computational Linguistics, 11 2019. Paszke et al. [2019] ↑ Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala.PyTorch: An Imperative Style, High-Performance Deep Learning Library.In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 8024–8035. Curran Associates, Inc., 2019. Neishi and Yoshinaga [2019] ↑ Masato Neishi and Naoki Yoshinaga.On the Relation between Position Information and Sentence Length in Neural Machine Translation.In Mohit Bansal and Aline Villavicencio, editors, Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 328–338. Association for Computational Linguistics, 11 2019. Loshchilov and Hutter [2019] ↑ Ilya Loshchilov and Frank Hutter.Decoupled Weight Decay Regularization.In International Conference on Learning Representations, 2019. Beltagy et al. [2020] ↑ Iz Beltagy, Matthew E. Peters, and Arman Cohan.Longformer: The Long-Document Transformer.arXiv preprint arXiv: 2004.05150, 2020. Wang et al. [2020] ↑ Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma.Linformer: Self-Attention with Linear Complexity.arXiv preprint arXiv: 2006.04768, 2020. Katharopoulos et al. [2020] ↑ Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret.Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5156–5165. PMLR, 07 2020. Tay et al. [2020] ↑ Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan.Sparse Sinkhorn Attention.In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 9438–9447. PMLR, 07 2020. Xiong et al. [2021] ↑ Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh.Nyströmformer: A Nyström-based Algorithm for Approximating Self-Attention.Proceedings of the AAAI Conference on Artificial Intelligence, 35(16):14138–14148, 05 2021. Ma et al. [2021] ↑ Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, and Luke Zettlemoyer.Luna: Linear Unified Nested Attention.In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. Nangia and Bowman [2018] ↑ Nikita Nangia and Samuel Bowman.ListOps: A Diagnostic Dataset for Latent Tree Learning.In Silvio Ricardo Cordeiro, Shereen Oraby, Umashanthi Pavalanathan, and Kyeongmin Rim, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 92–99. Association for Computational Linguistics, 06 2018. Maas et al. [2011] ↑ Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts.Learning Word Vectors for Sentiment Analysis.In Dekang Lin, Yuji Matsumoto, and Rada Mihalcea, editors, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150. Association for Computational Linguistics, 06 2011. Radev et al. [2009] ↑ Dragomir R. Radev, Pradeep Muthukrishnan, and Vahed Qazvinian.The ACL Anthology Network Corpus.In Min-Yen Kan and Simone Teufel, editors, Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries (NLPIR4DL), pages 54–61. Association for Computational Linguistics, 08 2009. Krizhevsky [2009] ↑ Alex Krizhevsky.Learning Multiple Layers of Features from Tiny Images.Technical report, 2009. Linsley et al. [2018] ↑ Drew Linsley, Junkyung Kim, Vijay Veerabadran, Charles Windolf, and Thomas Serre.Learning long-range spatial dependencies with horizontal gated recurrent units.In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. Kim et al. [2020] ↑ Junkyung Kim, Drew Linsley, Kalpit Thakkar, and Thomas Serre.Disentangling neural mechanisms for perceptual grouping.In International Conference on Learning Representations, 2020. Houtkamp and Roelfsema [2010] ↑ Roos Houtkamp and Pieter R Roelfsema.Parallel and serial grouping of image elements in visual perception.J. Exp. Psychol. Hum. Percept. Perform., 36(6):1443–1459, 12 2010. Liu et al. [2018] ↑ Liu Liu, Kaile Liu, Zhenghai Cong, Jiali Zhao, Yefei Ji, and Jun He.Long length document classification by local convolutional feature aggregation.Algorithms, 11(8), 2018. Gehring et al. [2017] ↑ Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin.Convolutional Sequence to Sequence Learning.In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1243–1252. PMLR, 08 2017. Haviv et al. [2022] ↑ Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy.Transformer Language Models without Positional Encodings Still Learn Positional Information.In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1382–1390. Association for Computational Linguistics, 12 2022. Chi et al. [2023] ↑ Ta-Chung Chi, Ting-Han Fan, Li-Wei Chen, Alexander Rudnicky, and Peter Ramadge.Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings.In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1183–1193. Association for Computational Linguistics, 07 2023. Kazemnejad et al. [2023] ↑ Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy.The Impact of Positional Encoding on Length Generalization in Transformers.In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 24892–24928. Curran Associates, Inc., 2023. Cao [2021] ↑ Shuhao Cao.Choose a Transformer: Fourier or Galerkin.In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. Shen et al. [2021] ↑ Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li.Efficient Attention: Attention With Linear Complexities.In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3531–3539, 01 2021. Zhang et al. [2021] ↑ Biao Zhang, Ivan Titov, and Rico Sennrich.Sparse Attention with Linear Units.In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6507–6520. Association for Computational Linguistics, 11 2021. Koohpayegani and Pirsiavash [2024] ↑ Soroush Abbasi Koohpayegani and Hamed Pirsiavash.SimA: Simple Softmax-Free Attention for Vision Transformers.In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2607–2617, 01 2024. Chen et al. [2021b] ↑ Yifan Chen, Qi Zeng, Heng Ji, and Yun Yang.Skyformer: Remodel Self-Attention with Gaussian Kernel and Nyström Method.In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 2122–2135. Curran Associates, Inc., 2021b. Nickolls et al. [2008] ↑ John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron.Scalable Parallel Programming with CUDA: Is CUDA the parallel programming model that application developers have been waiting for?Queue, 6(2):40–53, 03 2008. Fejér [1904] ↑ Leopold Fejér.Untersuchungen über Fouriersche Reihen.Mathematische Annalen, 58:51–69, 1904. Lanczos [1966] ↑ Cornelius Lanczos.Discourse on Fourier series.University mathematical monographs. Oliver and Boyd, 1966. Vekić and White [1993] ↑ M. Vekić and S. R. White.Smooth boundary conditions for quantum lattice systems.Physical Review Letters, 71:4283–4286, 12 1993. Appendix APseudocode for Synvolution and Kernelution Algorithm 1 PyTorch-like pseudocode for Parallel Scan.
b: batch size, n: length, d: feature dimension
class PScan(torch.autograd.Function): @staticmethod def expand_(A, X): if A.size(1) == 1: return T = 2 * (A.size(1) // 2) Aa = A[:, :T].view(A.size(0), T//2, 2, -1) Xa = X[:, :T].view(X.size(0), T//2, 2, -1) Xa[:, :, 1].add_(Aa[:, :, 1] * Xa[:, :, 0]) Aa[:, :, 1].mul_(Aa[:, :, 0]) PScan.expand_(Aa[:, :, 1], Xa[:, :, 1]) Xa[:, 1:, 0].add_(Aa[:, 1:, 0] * Xa[:, :-1, 1]) Aa[:, 1:, 0].mul_(Aa[:, :-1, 1]) if T < A.size(1): X[:, -1].add_(A[:, -1] * X[:, -2]) A[:, -1].mul_(A[:, -2]) @staticmethod def acc_rev_(A, X): if X.size(1) == 1: return T = 2 * (X.size(1) // 2) Aa = A[:, -T:].view(A.size(0), T//2, 2, -1) Xa = X[:, -T:].view(X.size(0), T//2, 2, -1) Xa[:, :, 0].add_(Aa[:, :, 1].conj() * Xa[:, :, 1]) B = Aa[:, :, 0].clone() B[:, 1:].mul_(Aa[:, :-1, 1].conj()) PScan.acc_rev_(B, Xa[:, :, 0]) Xa[:, :-1, 1].add_(Aa[:, 1:, 0].conj() * Xa[:, 1:, 0]) if T < A.size(1): X[:, 0].add_(A[:, 1].conj() * X[:, 1]) @staticmethod def forward(ctx, A, X, Y_init): ctx.A = A[:, :, None].clone() ctx.Y_init = Y_init[:, None, :].clone() ctx.A_star = ctx.A.clone() ctx.X_star = X.clone() PScan.expand_(ctx.A_star, ctx.X_star) return ctx.A_star * ctx.Y_init + ctx.X_star @staticmethod def backward(ctx, grad_output): U = grad_output * ctx.A_star.conj() A = ctx.A.clone() R = grad_output.clone() PScan.acc_rev_(A, R) Q = ctx.Y_init.expand_as(ctx.X_star).clone() Q[:, 1:].mul_(ctx.A_star[:, :-1].conj()).add_(ctx.X_star[:, :-1]) grad_A = (Q.conj() * R).sum(-1) return grad_A, R, U.sum(dim=1) pscan = PScan.apply
Algorithm 2 presents the PyTorch-like pseudocode for 1 -DHHP as a discrete unitary transform and its inverse transform, while Algorithm 3 demonstrates the PyTorch-like pseudocode for Synvolution and Kernelution.
Algorithm 2 PyTorch-like pseudocode for 1 -DHHP as a discrete unitary transform and its inverse transform.
b: batch size, n: length, d: feature dimension
m: permutation mapping dimension
def dhhp_trans(transform=True, x, g_l_ii, g_l_ij, g_l_ji, g_l_jj, g_u_ii, g_u_ij, g_u_ji, g_u_jj, diag): y = torch.zeros_like(x) z = torch.zeros_like(x) if transform is True: x = x.reshape(b, n // m, m, d).transpose(1, 2).reshape(b, n, d) else: x = torch.einsum(’bn,bnd->bnd’, diag, x) x_, y = torch.zeros_like(x), torch.zeros_like(x) x_[:, :-1, :] = g_u_ii.unsqueeze(-1) * x[:, :-1, :] x_ = x_.flip(1) g_u_ij = g_u_ij.flip(1) g_u_ij = torch.cat([g_u_ij, torch.zeros_like(g_u_ij[:, :1])], dim=1) p_u = torch.zeros_like(g_u_ij) p_u[:, 1:] = g_u_ij[:, :-1].clone() h_u_init = x_[:, 0, :].clone() h_u = pscan(p_u, x_, h_u_init) h_u = h_u.flip(1) y[:, 1:, :] = g_u_ji.unsqueeze(-1) * x[:, :-1, :] + g_u_jj.unsqueeze(-1) * h_u[:, 1:, :] y[:, 0, :] = h_u[:, 0, :] y_, z = torch.zeros_like(y), torch.zeros_like(y) y_[:, 1:, :] = g_l_jj.unsqueeze(-1) * y[:, 1:, :] g_l_ji = torch.cat([g_l_ji, torch.zeros_like(g_l_ji[:, :1])], dim=1) p_l = torch.zeros_like(g_l_ji) p_l[:, 1:] = g_l_ji[:, :-1].clone() h_l_init = y_[:, 0, :].clone() h_l = pscan(p_l, y_, h_l_init) z[:, :-1, :] = g_l_ii.unsqueeze(-1) * h_l[:, :-1, :] + g_l_ij.unsqueeze(-1) * y[:, 1:, :] z[:, n-1, :] = h_l[:, n-1, :] if transform is True: z = torch.einsum(’bn,bnd->bnd’, diag, z) else: z = z.reshape(b, m, n // m, d).transpose(1, 2).reshape(b, n, d) return z def inverse_dhhp_trans(transform=False, x, g_l_ii_conj_trs, g_l_ij_conj_trs, g_l_ji_conj_trs, g_l_jj_conj_trs, g_u_ii_conj_trs, g_u_ij_conj_trs, g_u_ji_conj_trs, g_u_jj_conj_trs, diag_conj_trs): return dhhp_trans(transform, x, g_u_ii_conj_trs, g_u_ij_conj_trs, g_u_ji_conj_trs, g_u_jj_conj_trs, g_l_ii_conj_trs, g_l_ij_conj_trs, g_l_ji_conj_trs, g_l_jj_conj_trs, diag_conj_trs)
Algorithm 3 PyTorch-like pseudocode for Synvolution and Kernelution.
b: batch size, n: length, d: feature dimension
def kernel_polynomial_method(seq, K, g, mu): Tx_0 = torch.ones_like(seq) # b, n cheb_gibbs = Tx_0 * mu[0] if K == 0: return cheb_gibbs Tx_1 = seq cheb_gibbs = cheb_gibbs + Tx_1 * g[1] * mu[1] if K == 1: return cheb_gibbs if K >= 2: for k in range(2, K+1): Tx_2 = 2 * seq * Tx_1 - Tx_0 cheb_gibbs = cheb_gibbs + Tx_2 * g[k] * mu[k] Tx_0, Tx_1 = Tx_1, Tx_2 return cheb_gibbs def givens_rot_para(alpha, beta, gamma): g_ii = torch.exp(-1j * (alpha + beta) / 2) * torch.cos(gamma / 2) g_ij = -torch.exp(1j * (alpha - beta) / 2) * torch.sin(gamma / 2) g_ji = torch.exp(-1j * (alpha - beta) / 2) * torch.sin(gamma / 2) g_jj = torch.exp(1j * (alpha + beta) / 2) * torch.cos(gamma / 2) return g_ii, g_ij, g_ji, g_jj def givens_rot_para_conj_trs(g_ii, g_ij, g_ji, g_jj): g_ii_conj_trs = g_jj g_ij_conj_trs = -g_ij g_ji_conj_trs = -g_ji g_jj_conj_trs = g_ii return g_ii_conj_trs, g_ij_conj_trs, g_ji_conj_trs, g_jj_conj_trs def kernelution(eigenvalue, K, g, mu, x, alpha_l, beta_l, gamma_l, alpha_u, beta_u, gamma_u, theta): if enable_kernelution is True: eigenvalue = kernel_polynomial_method(eigenvalue, K, g, mu) g_l_ii, g_l_ij, g_l_ji, g_l_jj = givens_rot_para(alpha_l, beta_l, gamma_l) g_u_ii, g_u_ij, g_u_ji, g_u_jj = givens_rot_para(alpha_u, beta_u, gamma_u) diag = torch.exp(2j * math.pi * theta) g_l_ii_conj_trs, g_l_ij_conj_trs, g_l_ji_conj_trs, g_l_jj_conj_trs = givens_rot_para_conj_trs(g_l_ii, g_l_ij, g_l_ji, g_l_jj) g_u_ii_conj_trs, g_u_ij_conj_trs, g_u_ji_conj_trs, g_u_jj_conj_trs = givens_rot_para_conj_trs(g_u_ii, g_u_ij, g_u_ji, g_u_jj) diag_conj_trs = diag.conj() x = dhhp_trans(True, x, g_l_ii, g_l_ij, g_l_ji, g_l_jj, g_u_ii, g_u_ij, g_u_ji, g_u_jj, diag) x = torch.einsum(’bn,bnd->bnd’, eigenvalue, x) x = inverse_dhhp_trans(False, x, g_l_ii_conj_trs, g_l_ij_conj_trs, g_l_ji_conj_trs, g_l_jj_conj_trs, g_u_ii_conj_trs, g_u_ij_conj_trs, g_u_ji_conj_trs, g_u_jj_conj_trs, diag_conj_trs) return x Appendix BSynvolution and Other Attention Mechanisms
We compare Synvolution with common attention and attention-like mechanisms in Table 5. Notice, we do not assume the relation between the length size 𝑁 and embedded size 𝐷 for input signal 𝐗 ∈ ℝ 𝑁 × 𝐷 . Sparse attention, which reduces computation by only attending to a subset of tokens, includes [Child et al., 2019, Zaheer et al., 2020]. Following the categorization in [Cao, 2021], linear attention can be classified into Fourier-type and Galerkin-type. Fourier-type attention, which compute query-key pairs first, includes [Shen et al., 2021, Cao, 2021, Zhang et al., 2021, Koohpayegani and Pirsiavash, 2024]. Galerkin-type attention, compute key-value pairs first, includes [Katharopoulos et al., 2020, Cao, 2021, Koohpayegani and Pirsiavash, 2024]. Kernel attention, which leverages kernel methods to approximate the attention operation, includes [Tsai et al., 2019, Choromanski et al., 2021, Chen et al., 2021b, Xiong et al., 2021]. Synthetic dense attention includes [Tay et al., 2021a]. Chord attention includes [Yu et al., 2022a].
While sparse attention theoretically offers lower time complexity, its practical implementation remains challenging. When implemented directly in PyTorch without CUDA optimization [Nickolls et al., 2008] or other GPU frameworks, matrix multiplication between sparse and dense tensors paradoxically consumes more GPU memory than multiplication between two dense tensors of equivalent size.
Linear attention mechanisms, whether Fourier-type or Galerkin-type, achieve lower time complexity compared to scaled dot-product attention. However, their attention matrices remain low-rank, similar to self-attention. For kernel attention mechanisms, their computational characteristics must be analyzed on a case-by-case basis due to implementation variations.
Synthesizer demonstrates potential through synthetic dense attention, though its time and space complexity remains equivalent to scaled dot-product attention. Chord attention, derived from the inverse process of Chord factorization [Khalitov et al., 2022], lacks CUDA optimization in its implementation. This leads to higher GPU memory consumption than linear attention but lower than scaled dot-product attention, similar to the challenges faced by sparse attention implementations.
Our proposed method, Synvolution, is an attention-like mechanism that leverages a space-for-time approach. Its attention matrix possesses several advantageous properties: it is learnable, full-rank, dense, and unitary. These characteristics enable Synvolution to achieve effectiveness comparable to self-attention across diverse domains.
Table 5:Overview of common attention and attention-like mechanisms. Method Attention Matrix Time Complexity Space Complexity Scaled Dot-Product Attention Learnable, Low-Rank, and Dense 𝒪 ( 𝑁 2 𝐷 + 𝑁 𝐷 2 )
𝒪 ( 𝑁 2 + 𝑁 𝐷 )
Sparse Attention Learnable, High/Full-Rank, and Sparse 𝒪 ( | ℰ | 𝐷 + 𝑁 𝐷 2 )
𝒪 ( | ℰ | + 𝑁 𝐷 )
Fourier-Type Attention Learnable, Low-Rank, and Dense 𝒪 ( 𝑁 2 𝐷 + 𝑁 𝐷 2 )
𝒪 ( 𝑁 2 + 𝑁 𝐷 )
Galerkin-Type Attention Learnable, Rank-dependent, and Dense 𝒪 ( 𝑁 𝐷 2 )
𝒪 ( 𝐷 2 + 𝑁 𝐷 )
Kernelized Attention Learnable, Low-Rank, and Dense Depends on kernel Depends on kernel Synthetic Dense Attention Learnable, Rank-Dependent, and Dense 𝒪 ( 𝑁 2 𝐷 + 𝑁 𝐷 2 )
𝒪 ( 𝑁 2 + 𝑁 𝐷 )
Chord Attention Learnable, Full-Rank, and Sparse 𝒪 ( 𝑁 𝐷 log 2 𝑁 )
𝒪 ( 𝑁 log 2 𝑁 + 𝑁 𝐷 )
Synvolution Learnable, Full-Rank, and Dense 𝒪 ( 𝑁 𝐷 log 𝑁 + 𝑁 𝐷 2 )
𝒪 ( 𝑁 𝐷 ) Appendix CKernel Functions in Kernel Polynomial Method
In Table 6, we summarize all currently known kernel functions used in the kernel polynomial method. To illustrate the approximation capabilities of different kernels when the Gibbs phenomenon occurs, we present a comparison in Figure 3, where the target function 𝑓 ( 𝑥 ) is a sign step function. For more details about the kernel polynomial method, see Weiße et al. [2006] and Weiße and Fehske [2008].
Table 6:Overview of kernel functions
Kernel Gibbs damping factor
𝑔
𝑘
Hyperparameters Remarks
Dirichlet 1 None least favorable choice
Fejér [Fejér, 1904]
1
−
𝑘
𝐾
+
1
None mainly of academic interest
Jackson [Weiße et al., 2006]
(
𝐾
+
2
−
𝑘
)
cos
(
𝑘
𝜋
𝐾
+
2
)
+
sin
(
𝑘
𝜋
𝐾
+
2
)
cot
(
𝜋
𝐾
+
2
)
𝐾
+
2
None optimal for most applications, but lacked rigorous proof
Lanczos [Lanczos, 1966]
sinc
𝑀
(
𝑘
𝜋
𝐾
+
1
)
𝑀
∈
ℕ
𝑀
3 closely matches the Jackson kernel Lorentz [Vijay et al., 2004] sinh [ 𝜉 ( 1 − 𝑘 𝐾 + 1 ) ] sinh ( 𝜉 )
𝜉 ∈ ℝ optimal for Green functions Vekić [Vekić and White, 1993] 1 2 − 1 2 tanh [ 𝑘 𝐾 + 1 − 1 2 𝑘 𝐾 + 1 ( 1 − 𝑘 𝐾 + 1 ) ] None found empirically Wang [Wang, 1994] e − ( 𝑎 𝑘 𝐾 + 1 ) 𝑏
𝑎 , 𝑏 ∈ ℝ found empirically Figure 3:An illustration of Gibbs phenomenon when using the kernel polynomial method with different kernels to approximate a step function. Appendix DKernel Polynomial Loss
Here is the complete derivation process for Equation 11.
∫
−
1
1
|
d
𝑓
(
𝑥
)
d
𝑥
|
2
d
𝑥
≈
∫
−
1
1
|
d
d
𝑥
∑
𝑘
0 𝐾 𝜇 𝑘 𝑇 𝑘 ( 𝑥 ) | 2 d 𝑥
= ∫ − 1 1 | ∑ 𝑘
1 𝐾 𝜇 𝑘 𝑇 𝑘 ′ ( 𝑥 ) | 2 d 𝑥
= ∫ − 1 1 ( ∑ 𝑘
1 𝐾 𝜇 𝑘 𝑇 𝑘 ′ ( 𝑥 ) ) ( ∑ 𝑚
1 𝐾 𝜇 𝑚 ¯ 𝑇 𝑚 ′ ( 𝑥 ) ) d 𝑥
= ∑ 𝑘
1 𝐾 ∑ 𝑚
1 𝐾 𝜇 𝑘 𝜇 𝑚 ¯ ∫ − 1 1 𝑇 𝑘 ′ ( 𝑥 ) 𝑇 𝑚 ′ ( 𝑥 ) d 𝑥
= ∑ 𝑘
1 𝐾 | 𝜇 𝑘 | 2 ∫ − 1 1 | 𝑇 𝑘 ′ ( 𝑥 ) | 2 d 𝑥
= ∑ 𝑘
1 𝐾 𝜋 𝑘 2 | 𝜇 𝑘 | 2
(15) Appendix EProofs
Proof of Proposition 2
Proof.
First, we note that DFT, DWHT, DCT, and DST are all discrete unitary transforms, meaning their matrices are unitary. By Assumption 1, any 𝑁 × 𝑁 unitary matrix can be exactly constructed using 𝐿 -DHHP where 1 ≤ 𝐿 ≤ ⌈ 𝑁 4 ⌉ . Therefore, as specific cases of unitary matrices, DFT, DWHT, DCT, and DST can all be exactly represented by 𝐿 -DHHP. For their inverses, since the inverse of a unitary matrix is its conjugate transpose, they can also be exactly represented by 𝐿 -DHHP. ∎
Proof of Proposition 3
Proof.
According to the definition of 𝐿 -DHHP, the transform matrix 𝚽 can be decomposed as 𝚽
𝐃 ( ∏ 𝑙
1 𝐿 𝐇 l ( 𝑙 ) 𝐇 u ( 𝑙 ) 𝐏 ( 𝑙 ) ) , where 𝐃 is a unitary diagonal matrix, 𝐇 l ( 𝑙 ) is a lower unitary Hessenberg matrix, 𝐇 u ( 𝑙 ) is a upper unitary Hessenberg matrix, and 𝐏 ( 𝑙 ) is a permutation matrix. The computation of 𝐲
𝚽 𝐱 can be broken down into sequential steps:
1.
For each order 𝑙 from 1 to 𝐿 :
2.
Multiplication with 𝐏 ( 𝑙 ) requires 𝒪 ( 𝑁 ) operations
3.
Multiplication with 𝐇 u ( 𝑙 ) requires 𝒪 ( 𝑁 log 𝑁 ) operations
4.
Multiplication with 𝐇 l ( 𝑙 ) requires 𝒪 ( 𝑁 log 𝑁 ) operations
5.
The multiplication of diagonal matrix 𝐃 with input signal 𝐱 requires 𝒪 ( 𝑁 ) operations.
As shown in Algorithm 2, these multiplications can be implemented efficiently through Hadamard products, the total time complexity is 𝒪 ( 𝐿 𝑁 log 𝑁 ) . ∎
Proof of Proposition 4
Proof.
( ⇐ ) First, we prove that if 𝐃 is unitary, then 𝐿 -DHHP is full-rank. By definition, every lower unitary Hessenberg matrix 𝐇 l ( 𝑙 ) , upper unitary Hessenberg matrix 𝐇 u ( 𝑙 ) , and permutation matrix 𝐏 ( 𝑙 ) is unitary. Since the product of unitary matrices is unitary, and 𝐃 is given to be unitary, the entire 𝐿 -DHHP matrix is unitary. As every unitary matrix is full-rank, 𝐿 -DHHP is full-rank.
( ⇒ ) Now we prove that if 𝐿 -DHHP is full-rank, then 𝐃 must be unitary. Let 𝚽
𝐃 ( ∏ 𝑙
1 𝐿 𝐇 l ( 𝑙 ) 𝐇 u ( 𝑙 ) 𝐏 ( 𝑙 ) ) be 𝐿 -DHHP. Since 𝐇 l ( 𝑙 ) , 𝐇 u ( 𝑙 ) , and 𝐏 ( 𝑙 ) are all unitary, their product is unitary and thus full-rank. For 𝚽 to be full-rank, 𝐃 must also be full-rank. As 𝐃 is diagonal, it is full-rank if and only if all its diagonal entries have unit magnitude, which is equivalent to 𝐃 being unitary. ∎
Proof of Theorem 5
Proof.
The complete proof can be found in Trefethen [2019]. ∎
Proof of Theorem 6
Proof.
The complete proof can be found in Trefethen [2019]. ∎
Appendix FAdditional Experimental Details
Baseline Implementations. We use the PyTorch implementation released by the authors or the third part for all baseline models. We list all baseline model implementations as following.
•
Transformer: https://github.com/hyunwoongko/transformer
•
Linformer: https://github.com/lucidrains/linformer
•
Performer: https://github.com/lucidrains/performer-pytorch
•
Synthesizer: https://github.com/10-zin/Synthesizer
•
Nyströmformer: https://github.com/mlpen/Nystromformer
•
FNet: https://github.com/erksch/fnet-pytorch
•
cosFormer: https://github.com/OpenNLPLab/cosFormer
•
Paramixer: https://github.com/wiedersehne/Paramixer
F.1Experimental Settings
We provide the complete set of hyperparameters used to train all models and report their results. The optimal hyperparameters were determined using Bayesian optimization under a 16 GB GPU memory constraint. After selecting the best hyperparameters based on minimum validation loss, we evaluated the corresponding model on the test set and reported its performance. For all four datasets and seven baseline models, we employed the AdamW optimizer with cross-entropy loss. The training configurations and hyperparameters for each neural network and dataset are summarized in Tables 7 and Table 8.
Table 7:The final baseline model hyperparameters used in experiments. Abbreviations: PE for position embedding, ES for embedding size, HS for hidden size, NB for number of blocks, NH for number of heads, Pool for pooling strategy, BS for batch size, LR for learning rate, WR for weight decay, and N/A indicates that the corresponding parameter is not present in the architecture. Dataset Model PE ES HS NB NH Pool BS LR WD PE Dropout Rate Value Dropout Rate FFN Dropout Rate LongDoc16K Transformer RPE 64 256 2 2 MEAN 2 0.0005 0.0001 0.1 0.1 0.1 Linformer RPE 128 512 2 2 MEAN 4 0.0002 0.0001 0.1 0.1 0.1 Performer RPE 128 512 2 2 MEAN 4 0.0002 0.0001 0.1 0.1 0.1 Synthesizer RPE 64 256 2 2 MEAN 2 0.0005 0.0001 0.1 0.1 0.1 FNet RPE 128 512 2 N/A MEAN 4 0.0002 0.0001 0.1 0.1 0.1 cosFormer RPE 128 512 2 2 MEAN 4 0.0002 0.0001 0.1 0.1 0.1 Paramixer RPE 128 512 2 N/A MEAN 4 0.0002 0.0001 0.1 0.1 0.1 LongDoc16K Transformer RPE 64 256 2 1 MEAN 2 0.0005 0.0001 0.1 0.1 0.1 Linformer RPE 128 512 2 2 MEAN 4 0.0002 0.0001 0.1 0.1 0.1 Performer RPE 128 512 2 2 MEAN 4 0.0002 0.0001 0.1 0.1 0.1 Synthesizer RPE 64 256 2 1 MEAN 2 0.0005 0.0001 0.1 0.1 0.1 FNet RPE 128 512 2 N/A MEAN 4 0.0002 0.0001 0.1 0.1 0.1 cosFormer RPE 128 512 2 2 MEAN 4 0.0002 0.0001 0.1 0.1 0.1 Paramixer RPE 128 512 2 N/A MEAN 4 0.0002 0.0001 0.1 0.1 0.1 Ensembl (B/S) Transformer RPE 64 256 2 1 MEAN 2 0.0002 0.0001 0.1 0.1 0.1 Linformer RPE 64 256 2 2 MEAN 2 0.0001 0.0001 0.1 0.1 0.1 Performer RPE 64 256 2 2 MEAN 2 0.0001 0.0001 0.1 0.1 0.1 Synthesizer RPE 64 256 2 1 MEAN 2 0.0002 0.0001 0.1 0.1 0.1 FNet RPE 64 256 2 N/A MEAN 2 0.0001 0.0001 0.1 0.1 0.1 cosFormer RPE 64 256 2 2 MEAN 2 0.0001 0.0001 0.1 0.1 0.1 Paramixer RPE 64 256 2 N/A MEAN 2 0.0001 0.0001 0.1 0.1 0.1 Ensembl (M/R) Transformer RPE 64 256 2 1 MEAN 2 0.0002 0.0001 0.1 0.1 0.1 Linformer RPE 64 256 2 2 MEAN 2 0.0001 0.0001 0.1 0.1 0.1 Performer RPE 64 256 2 2 MEAN 2 0.0001 0.0001 0.1 0.1 0.1 Synthesizer RPE 64 256 2 1 MEAN 2 0.0002 0.0001 0.1 0.1 0.1 FNet RPE 64 256 2 N/A MEAN 2 0.0001 0.0001 0.1 0.1 0.1 cosFormer RPE 64 256 2 2 MEAN 2 0.0001 0.0001 0.1 0.1 0.1 Paramixer RPE 64 256 2 N/A MEAN 2 0.0001 0.0001 0.1 0.1 0.1 Table 8:The hyperparameters of Converter for all experiments. Abbreviations: PE for position embedding, ES for embedding size, HS for hidden size, K for the maximum order of KPM, Pool for pooling strategy, BS for batch size, LR for learning rate, and WR for weight decay. Dataset PE ES HS K 𝜂 Pool BS LR WD PE Dropout Rate Value Dropout Rate GFFN Dropout Rate Eigenvalue Dropout Rate Eigenvector Dropout Rate ListOps RPE 32 128 2 0.001 MEAN 128 0.001 0.001 0.1 0.1 0.1 0.1 0.1 Text RPE 64 256 2 0.001 MEAN 128 0.001 0.001 0.1 0.1 0.1 0.1 0.1 Retrieval RPE 64 256 2 0.001 MEAN 256 0.001 0.001 0.1 0.1 0.1 0.1 0.1 Image RPE 64 256 2 0.01 MEAN 128 0.001 0.001 0.1 0.1 0.1 0.1 0.1 Pathfinder RPE 64 256 2 0.001 MEAN 256 0.0002 0.0002 0.1 0.1 0.1 0.1 0.1 LongDoc16K RPE 128 512 2 0.1 MEAN 4 0.001 0.001 0.1 0.1 0.1 0.1 0.1 LongDoc32K RPE 128 512 2 0.1 MEAN 4 0.001 0.001 0.1 0.1 0.1 0.1 0.1 Ensembl (B/S) RPE 128 512 2 0.1 MEAN 32 0.001 0.001 0.1 0.1 0.1 0.1 0.1 Ensembl (M/R) RPE 128 512 2 0.1 MEAN 32 0.001 0.001 0.1 0.1 0.1 0.1 0.1 Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors
We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:
Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.
Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.
Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.
Xet Storage Details
- Size:
- 85.5 kB
- Xet hash:
- 0b0c05453d504c20c656013d91fbc30911748b5423df2c2312d0869c694a5043
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.