Add model card for VideoMLA

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +38 -0
README.md ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: text-to-video
3
+ ---
4
+
5
+ # VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
6
+
7
+ VideoMLA is the first study of Multi-Head Latent Attention (MLA) in video diffusion. By replacing per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, it reduces per-token KV memory by 92.7% at every cached layer. This enables efficient, minute-scale autoregressive video generation with improved throughput.
8
+
9
+ [[Paper](https://huggingface.co/papers/2605.30351)] [[Project Page](https://videomla.github.io/)] [[GitHub](https://github.com/yesiltepe-hidir/VideoMLA)]
10
+
11
+ ## Inference
12
+
13
+ To use the model, please follow the setup instructions in the [official repository](https://github.com/yesiltepe-hidir/VideoMLA). You can generate videos using the provided inference script:
14
+
15
+ ```bash
16
+ python inference.py \
17
+ --config_path configs/stage3_long.yaml \
18
+ --checkpoint_path checkpoints/stage3_la6_sink1/model.pt \
19
+ --output_folder outputs/ \
20
+ --data_path prompts/your_prompts.txt \
21
+ --num_output_frames 120 \
22
+ --use_ema
23
+ ```
24
+
25
+ Key arguments:
26
+ - `--num_output_frames`: Controls the length of the video (e.g., 21 ≈ 5s, 120 ≈ 30s, 240 ≈ 60s at 16fps).
27
+ - `--data_path`: A text file containing prompts (one per line).
28
+
29
+ ## Citation
30
+
31
+ ```bibtex
32
+ @article{yesiltepe2026videomla,
33
+ title={VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion},
34
+ author={Yesiltepe, Hidir and Hu, Jiazhen and Meral, Tuna Han Salih and Akan, Adil Kaan and Oktay, Kaan and Eldardiry, Hoda and Yanardag, Pinar},
35
+ journal={arXiv preprint arXiv:2605.30351},
36
+ year={2026}
37
+ }
38
+ ```