| --- |
| pipeline_tag: text-to-video |
| --- |
| |
| # VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion |
|
|
| VideoMLA is the first study of Multi-Head Latent Attention (MLA) in video diffusion. By replacing per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, it reduces per-token KV memory by 92.7% at every cached layer. This enables efficient, minute-scale autoregressive video generation with improved throughput. |
|
|
| [[Paper](https://huggingface.co/papers/2605.30351)] [[Project Page](https://videomla.github.io/)] [[GitHub](https://github.com/yesiltepe-hidir/VideoMLA)] |
|
|
| ## Inference |
|
|
| To use the model, please follow the setup instructions in the [official repository](https://github.com/yesiltepe-hidir/VideoMLA). You can generate videos using the provided inference script: |
|
|
| ```bash |
| python inference.py \ |
| --config_path configs/stage3_long.yaml \ |
| --checkpoint_path checkpoints/stage3_la6_sink1/model.pt \ |
| --output_folder outputs/ \ |
| --data_path prompts/your_prompts.txt \ |
| --num_output_frames 120 \ |
| --use_ema |
| ``` |
|
|
| Key arguments: |
| - `--num_output_frames`: Controls the length of the video (e.g., 21 ≈ 5s, 120 ≈ 30s, 240 ≈ 60s at 16fps). |
| - `--data_path`: A text file containing prompts (one per line). |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{yesiltepe2026videomla, |
| title={VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion}, |
| author={Yesiltepe, Hidir and Hu, Jiazhen and Meral, Tuna Han Salih and Akan, Adil Kaan and Oktay, Kaan and Eldardiry, Hoda and Yanardag, Pinar}, |
| journal={arXiv preprint arXiv:2605.30351}, |
| year={2026} |
| } |
| ``` |