nielsr HF Staff commited on
Commit
12887ea
·
verified ·
1 Parent(s): c1a9885

Improve model card and add paper link

Browse files

Hi! I'm Niels, part of the community science team at Hugging Face. I've noticed this repository is missing a detailed model card and metadata.

I've opened this PR to:
- Add `pipeline_tag: text-generation` to the metadata for better discoverability.
- Link the repository to the official paper on Hugging Face ([2603.02188](https://huggingface.co/papers/2603.02188)).
- Add a link to the official GitHub repository.
- Provide a brief summary of the Multi-Head Low-Rank Attention (MLRA) mechanism and the official citation.

Files changed (1) hide show
  1. README.md +28 -3
README.md CHANGED
@@ -1,3 +1,28 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ ---
5
+
6
+ # Multi-Head Low-Rank Attention (MLRA)
7
+
8
+ Official pretrained weights for **Multi-Head Low-Rank Attention (MLRA)**, a novel attention mechanism that natively supports 4-way tensor parallelism and significantly reduces the key-value (KV) cache size, enabling efficient long-context inference at scale.
9
+
10
+ ## Resources
11
+ - **Paper:** [Multi-Head Low-Rank Attention](https://huggingface.co/papers/2603.02188)
12
+ - **Code:** [Official GitHub Repository](https://github.com/SongtaoLiu0823/MLRA)
13
+
14
+ ## Model Description
15
+ Long-context inference in large language models is often bottlenecked by KV cache loading during the decoding stage. While Multi-Head Latent Attention (MLA) reduces the total KV cache size, it suffers from a sharding bottleneck during distributed decoding via Tensor Parallelism (TP).
16
+
17
+ MLRA addresses this by enabling partitionable latent states for efficient 4-way TP decoding. Experimental results show that MLRA achieves state-of-the-art perplexity and downstream task performance, while delivering a 2.8$\times$ decoding speedup over MLA.
18
+
19
+ ## Citation
20
+ If you find this work useful, please cite:
21
+ ```bibtex
22
+ @inproceedings{liu2026multi,
23
+ title = {Multi-Head Low-Rank Attention},
24
+ author = {Liu, Songtao and Peng, Hongwu and Zhang, Zhiwei and Chen, Zhengyu and Guo, Yue},
25
+ booktitle = {International Conference on Learning Representations},
26
+ year = {2026}
27
+ }
28
+ ```