Abstract
Researchers develop a method to convert pretrained Softmax attention models to linear-complexity Test-Time Training architectures through architectural and representational alignment, achieving fast inference with minimal fine-tuning.
While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We identify Test-Time Training (TTT) as a linear-complexity architecture whose two-layer dynamic formulation is structurally aligned with Softmax attention, enabling direct inheritance of pretrained attention weights. To further align representational properties, including key shift-invariance and locality, we introduce key instance normalization and a lightweight locality enhancement module. We validate our approach by linearizing Stable Diffusion 3.5 and introduce SD3.5-T^5 (Transformer To Test Time Training). With only 1 hour of fine-tuning on 4timesH20 GPUs, SD3.5-T^5 achieves comparable text-to-image quality to the fine-tuned Softmax model, while accelerating inference by 1.32times and 1.47times at 1K and 2K resolutions. Code is available at https://github.com/LeapLabTHU/Transformer-to-TTT.
Community
Linearizing Vision Transformers with Test-Time Training offers a simple and straightforward way to transform pretrained Transformer models into TTT models, without distillation or complex training procedures.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Linear-Time Global Visual Modeling without Explicit Attention (2026)
- LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs (2026)
- JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search (2026)
- Evolving Layer-Specific Scalar Functions for Hardware-Aware Transformer Adaptation (2026)
- LaplacianFormer:Rethinking Linear Attention with Laplacian Kernel (2026)
- Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction (2026)
- LoCO: Low-rank Compositional Rotation Fine-tuning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.02772 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper