Improve model card and add paper link
Browse filesHi! I'm Niels, part of the community science team at Hugging Face. I've noticed this repository is missing a detailed model card and metadata.
I've opened this PR to:
- Add `pipeline_tag: text-generation` to the metadata for better discoverability.
- Link the repository to the official paper on Hugging Face ([2603.02188](https://huggingface.co/papers/2603.02188)).
- Add a link to the official GitHub repository.
- Provide a brief summary of the Multi-Head Low-Rank Attention (MLRA) mechanism and the official citation.
README.md
CHANGED
|
@@ -1,3 +1,28 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: text-generation
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# Multi-Head Low-Rank Attention (MLRA)
|
| 7 |
+
|
| 8 |
+
Official pretrained weights for **Multi-Head Low-Rank Attention (MLRA)**, a novel attention mechanism that natively supports 4-way tensor parallelism and significantly reduces the key-value (KV) cache size, enabling efficient long-context inference at scale.
|
| 9 |
+
|
| 10 |
+
## Resources
|
| 11 |
+
- **Paper:** [Multi-Head Low-Rank Attention](https://huggingface.co/papers/2603.02188)
|
| 12 |
+
- **Code:** [Official GitHub Repository](https://github.com/SongtaoLiu0823/MLRA)
|
| 13 |
+
|
| 14 |
+
## Model Description
|
| 15 |
+
Long-context inference in large language models is often bottlenecked by KV cache loading during the decoding stage. While Multi-Head Latent Attention (MLA) reduces the total KV cache size, it suffers from a sharding bottleneck during distributed decoding via Tensor Parallelism (TP).
|
| 16 |
+
|
| 17 |
+
MLRA addresses this by enabling partitionable latent states for efficient 4-way TP decoding. Experimental results show that MLRA achieves state-of-the-art perplexity and downstream task performance, while delivering a 2.8$\times$ decoding speedup over MLA.
|
| 18 |
+
|
| 19 |
+
## Citation
|
| 20 |
+
If you find this work useful, please cite:
|
| 21 |
+
```bibtex
|
| 22 |
+
@inproceedings{liu2026multi,
|
| 23 |
+
title = {Multi-Head Low-Rank Attention},
|
| 24 |
+
author = {Liu, Songtao and Peng, Hongwu and Zhang, Zhiwei and Chen, Zhengyu and Guo, Yue},
|
| 25 |
+
booktitle = {International Conference on Learning Representations},
|
| 26 |
+
year = {2026}
|
| 27 |
+
}
|
| 28 |
+
```
|