VideoMLA / README.md

nielsr HF Staff

Add model card for VideoMLA

303369a verified about 21 hours ago

1.66 kB

pipeline_tag: text-to-video

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

VideoMLA is the first study of Multi-Head Latent Attention (MLA) in video diffusion. By replacing per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, it reduces per-token KV memory by 92.7% at every cached layer. This enables efficient, minute-scale autoregressive video generation with improved throughput.

[Paper] [Project Page] [GitHub]

Inference

To use the model, please follow the setup instructions in the official repository. You can generate videos using the provided inference script:

python inference.py \
    --config_path configs/stage3_long.yaml \
    --checkpoint_path checkpoints/stage3_la6_sink1/model.pt \
    --output_folder outputs/ \
    --data_path prompts/your_prompts.txt \
    --num_output_frames 120 \
    --use_ema

Key arguments:

--num_output_frames: Controls the length of the video (e.g., 21 ≈ 5s, 120 ≈ 30s, 240 ≈ 60s at 16fps).
--data_path: A text file containing prompts (one per line).

Citation

@article{yesiltepe2026videomla,
  title={VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion},
  author={Yesiltepe, Hidir and Hu, Jiazhen and Meral, Tuna Han Salih and Akan, Adil Kaan and Oktay, Kaan and Eldardiry, Hoda and Yanardag, Pinar},
  journal={arXiv preprint arXiv:2605.30351},
  year={2026}
}