MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

Kewei Zhang^1*, Ye Huang^1*, Yufan Deng¹, Jincheng Yu², Junsong Chen²,
Huan Ling², Enze Xie², Daquan Zhou¹

¹Peking University ²NVIDIA

MHLA is a universal high-efficiency linear attention operator. MHLA can be applied to image classification, image generation, language modeling, and video generation tasks, maintaining performance consistent with Flash Attention while achieving significant speed advantages over Flash Attention under long-sequence conditions. For more details, please refer to our paper.

This repository is organized into four sub-projects: mhla_dit, mhla_image_classification, mhla_nlp, and mhla_videogen. Each corresponds to the experimental code for the four tasks presented in our paper. Each sub-project contains its own README.md with detailed instructions.

Updates

[2026.01.12] 🔥 Our paper is available at arxiv.
[2026.01.12] 🔥 We release the code of MHLA, including training and inference code for image classification, image generation, language modeling, and video generation.

Installation & Usage

Please refer to the README.md files in the following sub-projects for detailed information:

Performance & Efficiency

On Wan2.1-1.3B

Method	Quality score	Semantic score	Total	Latency
Wan2.1 1.3B	85.23	75.65	83.31	139s
Full MHLA	83.93	78.40	82.83	62s
Full Linear	69.96	11.38	58.24	62s
MHLA Hybrid 2/3	84.87	79.59	83.82	84s

Wan-MHLA and Wan-LA replace all layers with MHLA and Linear Attention respectively. Wan-MHLA-H only replace 2/3 layers.

Acknowledgement

Our project is built on multiple inspiring projects including: timm, DiT, Sana and flash-linear-attention.

Support Us

If you find this work useful, please consider:

Starring the repository
Citing our paper
Contributing to the codebase

Citation

@misc{mhla,
      title={MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head}, 
      author={Kewei Zhang and Ye Huang and Yufan Deng and Jincheng Yu and Junsong Chen and Huan Ling and Enze Xie and Daquan Zhou},
      year={2026},
      eprint={2601.07832},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.07832}, 
}

Downloads last month: 2

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DAGroup-PKU/MHLA

Base model

Wan-AI/Wan2.1-T2V-1.3B

Finetuned

(56)

this model

Paper for DAGroup-PKU/MHLA

MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

Paper • 2601.07832 • Published Jan 12 • 53