Soughing
/

MLRA

Text Generation

Model card Files Files and versions

MLRA / README.md

Soughing's picture

Improve model card and add paper link (#1)

e3ad42d 2 days ago

|

history blame contribute delete

1.44 kB

	---
	license: apache-2.0
	pipeline_tag: text-generation
	---

	# Multi-Head Low-Rank Attention (MLRA)

	Official pretrained weights for Multi-Head Low-Rank Attention (MLRA), a novel attention mechanism that natively supports 4-way tensor parallelism and significantly reduces the key-value (KV) cache size, enabling efficient long-context inference at scale.

	## Resources
	- Paper: [Multi-Head Low-Rank Attention](https://huggingface.co/papers/2603.02188)
	- Code: [Official GitHub Repository](https://github.com/SongtaoLiu0823/MLRA)

	## Model Description
	Long-context inference in large language models is often bottlenecked by KV cache loading during the decoding stage. While Multi-Head Latent Attention (MLA) reduces the total KV cache size, it suffers from a sharding bottleneck during distributed decoding via Tensor Parallelism (TP).

	MLRA addresses this by enabling partitionable latent states for efficient 4-way TP decoding. Experimental results show that MLRA achieves state-of-the-art perplexity and downstream task performance, while delivering a 2.8$\times$ decoding speedup over MLA.

	## Citation
	If you find this work useful, please cite:
	```bibtex
	@inproceedings{liu2026multi,
	title = {Multi-Head Low-Rank Attention},
	author = {Liu, Songtao and Peng, Hongwu and Zhang, Zhiwei and Chen, Zhengyu and Guo, Yue},
	booktitle = {International Conference on Learning Representations},
	year = {2026}
	}
	```