| license: apache-2.0 | |
| pipeline_tag: text-generation | |
| # Multi-Head Low-Rank Attention (MLRA) | |
| Official pretrained weights for **Multi-Head Low-Rank Attention (MLRA)**, a novel attention mechanism that natively supports 4-way tensor parallelism and significantly reduces the key-value (KV) cache size, enabling efficient long-context inference at scale. | |
| ## Resources | |
| - **Paper:** [Multi-Head Low-Rank Attention](https://huggingface.co/papers/2603.02188) | |
| - **Code:** [Official GitHub Repository](https://github.com/SongtaoLiu0823/MLRA) | |
| ## Model Description | |
| Long-context inference in large language models is often bottlenecked by KV cache loading during the decoding stage. While Multi-Head Latent Attention (MLA) reduces the total KV cache size, it suffers from a sharding bottleneck during distributed decoding via Tensor Parallelism (TP). | |
| MLRA addresses this by enabling partitionable latent states for efficient 4-way TP decoding. Experimental results show that MLRA achieves state-of-the-art perplexity and downstream task performance, while delivering a 2.8$\times$ decoding speedup over MLA. | |
| ## Citation | |
| If you find this work useful, please cite: | |
| ```bibtex | |
| @inproceedings{liu2026multi, | |
| title = {Multi-Head Low-Rank Attention}, | |
| author = {Liu, Songtao and Peng, Hongwu and Zhang, Zhiwei and Chen, Zhengyu and Guo, Yue}, | |
| booktitle = {International Conference on Learning Representations}, | |
| year = {2026} | |
| } | |
| ``` |