File size: 1,439 Bytes
e3ad42d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | ---
license: apache-2.0
pipeline_tag: text-generation
---
# Multi-Head Low-Rank Attention (MLRA)
Official pretrained weights for **Multi-Head Low-Rank Attention (MLRA)**, a novel attention mechanism that natively supports 4-way tensor parallelism and significantly reduces the key-value (KV) cache size, enabling efficient long-context inference at scale.
## Resources
- **Paper:** [Multi-Head Low-Rank Attention](https://huggingface.co/papers/2603.02188)
- **Code:** [Official GitHub Repository](https://github.com/SongtaoLiu0823/MLRA)
## Model Description
Long-context inference in large language models is often bottlenecked by KV cache loading during the decoding stage. While Multi-Head Latent Attention (MLA) reduces the total KV cache size, it suffers from a sharding bottleneck during distributed decoding via Tensor Parallelism (TP).
MLRA addresses this by enabling partitionable latent states for efficient 4-way TP decoding. Experimental results show that MLRA achieves state-of-the-art perplexity and downstream task performance, while delivering a 2.8$\times$ decoding speedup over MLA.
## Citation
If you find this work useful, please cite:
```bibtex
@inproceedings{liu2026multi,
title = {Multi-Head Low-Rank Attention},
author = {Liu, Songtao and Peng, Hongwu and Zhang, Zhiwei and Chen, Zhengyu and Guo, Yue},
booktitle = {International Conference on Learning Representations},
year = {2026}
}
``` |