File size: 1,663 Bytes
303369a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | ---
pipeline_tag: text-to-video
---
# VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
VideoMLA is the first study of Multi-Head Latent Attention (MLA) in video diffusion. By replacing per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, it reduces per-token KV memory by 92.7% at every cached layer. This enables efficient, minute-scale autoregressive video generation with improved throughput.
[[Paper](https://huggingface.co/papers/2605.30351)] [[Project Page](https://videomla.github.io/)] [[GitHub](https://github.com/yesiltepe-hidir/VideoMLA)]
## Inference
To use the model, please follow the setup instructions in the [official repository](https://github.com/yesiltepe-hidir/VideoMLA). You can generate videos using the provided inference script:
```bash
python inference.py \
--config_path configs/stage3_long.yaml \
--checkpoint_path checkpoints/stage3_la6_sink1/model.pt \
--output_folder outputs/ \
--data_path prompts/your_prompts.txt \
--num_output_frames 120 \
--use_ema
```
Key arguments:
- `--num_output_frames`: Controls the length of the video (e.g., 21 ≈ 5s, 120 ≈ 30s, 240 ≈ 60s at 16fps).
- `--data_path`: A text file containing prompts (one per line).
## Citation
```bibtex
@article{yesiltepe2026videomla,
title={VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion},
author={Yesiltepe, Hidir and Hu, Jiazhen and Meral, Tuna Han Salih and Akan, Adil Kaan and Oktay, Kaan and Eldardiry, Hoda and Yanardag, Pinar},
journal={arXiv preprint arXiv:2605.30351},
year={2026}
}
``` |