File size: 7,838 Bytes
c4efaef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# Ming-VideoMAR: Autoregressive Video Generation with Continuous Tokens

<p align="center">
    <img src="./figures/ant-bailing.png" width="100"/>
<p>

<p align="center">πŸ€— <a href="https://huggingface.co/inclusionAI/Ming-VideoMAR">Hugging Face </a>ο½œπŸ“„ <a href="https://www.arxiv.org/abs/2506.14168">Paper (NeurIPS 2025) </a> </p>


## 🌍 Introduction

- 🌐 **The First NTP MLLM with Continuous Unified Vision Representations:**
 Ming-VideoMAR is a concise and efficient decoder-only autoregressive image-to-video model with continuous tokens, composing temporal frame-by-frame and spatial masked generation. Ming-VideoMAR identifies temporal causality and spatial bi-directionality as the first principle of video AR models, and proposes the next-frame diffusion loss for the integration of mask and video generation. 
- πŸ–ΌοΈ **First Zero-shot Resolution Scaling for Video Generation:** 
 Ming-VideoMAR replicates the unique capacity of sequence extrapolation from language models to video generation. It supports generating videos of flexible spatial and temporal resolutions that is far beyond the training resolution. This is achieved by solving the training-inference gap and adopting the 3D rotary embeddings. 
- ⚑ **Extreme Hihg Training Efficiency:**
 Ming-VideoMAR proposes the temporal short-to-long curriculum learning and spatial progressive resolution training. It surpasses the previous state-of-the-art (Cosmos I2V) while requiring significantly fewer parameters (9.3%), training data (0.5%), and GPU resources (0.2%), both quantatively and qualitatively.
- ⚑ **Extreme Hihg Inference Efficiency:**
 Ming-VideoMAR inherently bears high efficiency due to simultaneous temporal-wise KV cache and spatial-wise parallel generation, significantly surpassing the NTP counterpart.
- πŸ”— **Accumulation Error Solution:**
 Ming-VideoMAR employs the progressive temperature strategy at inference time to mitigate the accumulation error.


<p align="center">
    <img src="./figures/videomar-overall.png" width="800"/>
<p>

## πŸ“Œ Updates

* [2025.10.17] πŸ”₯ **Code and Checkpoint!**  
  We’re thrilled to announce the code and checkpoint release of **Ming-VideoMAR** !  
* [2025 09.19] πŸŽ‰ **Our paper is accepted by NeurIPS 2025.**
* [2025.06.18] πŸ“„ **Technical Report Released!**  
  The full technical report is now available on arXiv:  
  πŸ‘‰ [VideoMAR: Autoregressive Video Generation with Continuous Tokens](https://www.arxiv.org/pdf/2506.14168)



<!-- ## Key Features -->


<!-- <p align="center">
    <img src="./figures/uniaudio-tokenizer.pdf" width="800"/>
<p> -->

## πŸ“Š Evaluation

Ming-VideoMAR achieves sota autoregressive image-to-video generation performance with extremely small training and inference costs. 

### Quantitative Comparison
Ming-VideoMAR achieves sota performance across the token-wise autoregressive video generation models with sifnificantly lower training cost.

<p align="center">
    <img src="./figures/videomar-quantitative.png" width="800"/>
<p>

### Qualitative Comparison
Ming-VideoMAR achieves better quality and finer details than the Cosmos baseline, even under lower resolution (Ming-VideoMAR:480x768 VS Cosmos:640x1024).

<p align="center">
    <img src="./figures/videomar-qualitative.png" width="800"/>
<p>

### Qualitative Comparison
Ming-VideoMAR first unlocks the resolution scaling ability to flexibly generate higher or lower resolutions beyond the training scope.

<p align="center">
    <img src="./figures/videomar-extrapolation.png" width="800"/>
<p>
  
## πŸ“₯ Model Downloads

| Model | Hugging Face | ModelScope |
|-------|--------------|------------|
| **Stage1(25x256x256)** | [Download](https://huggingface.co/inclusionAI/Ming-VideoMAR) | [Download](https://www.modelscope.cn/models/inclusionAI/Ming-VideoMAR) |
| **Stage2(49x480x768)** | [Download](https://huggingface.co/inclusionAI/Ming-VideoMAR) | [Download](https://www.modelscope.cn/models/inclusionAI/Ming-VideoMAR) |

> πŸ”— Both models are publicly available for research. Visit the respective pages for model details, inference examples, and integration guides.

## πŸš€ Example Usage

### πŸ”§ Installation

Download the code:
```
git clone https://github.com/inclusionAI/Ming-VideoMAR.git
cd Ming-VideoMAR
```

A suitable [conda](https://conda.io/) environment named `videomar` can be created and activated with:

```
conda env create -f environment.yaml
conda activate videomar
```

### πŸ–ΌοΈ Training
Run the following command, which contains the script for the training of VideoMAR.
```
bash train.sh
```

Specifically, take the default training script of stage2 for example:
```
torchrun --standalone --nnodes 1 --nproc_per_node 8   main_videomar.py    \
--img_size_h 480  --img_size_w 768 --vae_embed_dim 16 --vae_spatial_stride 16 --vae_tempotal_stride 8 --patch_size 1  \
--model videomar --diffloss_d 3 --diffloss_w 1280  --save_last_freq 100  --num_workers 2  --file_type video  \
--epochs 800 --warmup_epochs 200 --batch_size 1 --blr 2.0e-4 --diffusion_batch_mul 4  --ema --ema_rate 0.995  --num_frames 49    \
--online_eval  --eval_freq 100  --eval_bsz 1  --cfg 3.0   --num_iter 32  \
--Cosmos_VAE  --vae_path $Cosmos-Tokenizer-CV8x16x16$ \
--output_dir logs  \
--text_model_path $Qwen2-VL-1.5B-Instruct$ \
--data_path $your_data_path$ \
```

**Note!**  
This model is trained with our inner data, and therefore the original dataloader code is tailored for inner oss file system. 
If your want to train this model with your own data, you should replace the following **Your_DataReader** (Line 219 in main_videomar.py) with your own dataloader code.

```
######################### Load Dataset #########################
    dataset_train = Your_DataReader(data_path=args.data_path, img_size=[args.img_size_h, args.img_size_w], num_frames=args.num_frames, file_type=args.file_type)   # Replace this with your data reader file
    sampler_train = DistributedSampler(dataset_train, num_replicas=num_tasks, rank=global_rank, shuffle=True)
    data_loader_train = DataLoader(
        dataset_train,
        sampler=sampler_train,
        batch_size=args.batch_size,
        num_workers=args.num_workers,
        pin_memory=args.pin_mem,
        drop_last=True,
    )
```

### πŸ–ΌοΈ Inference
Run the following command, which contains the script for the inference of VideoMAR.
```
bash samle.sh
```

Specifically, take the default inference script of stage2 for example:
```
CUDA_VISIBLE_DEVICES='0'    torchrun --standalone --nnodes 1 --nproc_per_node 1  main_videomar.py  \
--model videomar --diffloss_d 3 --diffloss_w 1280  --eval_bsz 1  --evaluate  \
--img_size_h 480  --img_size_w 768 --vae_embed_dim 16 --vae_spatial_stride 16 --vae_tempotal_stride 8   \
--i2v  --cond_frame 1  --cfg 3.0  --temperature 1.0  --num_frames 49  --num_sampling_steps 100  --num_iter 64  \
--Cosmos_VAE  --vae_path $Cosmos-Tokenizer-CV8x16x16$ \
--output_dir logs  \
--text_model_path $Qwen2-VL-1.5B-Instruct$ \
--resume ./ckpt/checkpoint-736.pth \
```


πŸ“Œ Tips:
- $Cosmos-Tokenizer-CV8\times 16\times 16$: download [Cosmos-CV8x16x16](https://huggingface.co/nvidia/Cosmos-0.1-Tokenizer-CV8x16x16/tree/main) and replace it with your downloaded path.
- $Qwen2-VL-1.5B-Instruct$: download [Qwen2-VL-1.5B](https://huggingface.co/mit-han-lab/Qwen2-VL-1.5B-Instruct/tree/main), and optionally load it locally.
- VideoMAR checkpoint: download the checkpoint and place it in ./ckpt/.


## ✍️ Citation

If you find our work useful in your research or applications, please consider citing:
```
@article{yu2025videomar,,
  title={VideoMAR: Autoregressive Video Generatio with Continuous Tokens},
  author={Hu Yu, Biao Gong, Hangjie Yuan, DanDan Zheng, Weilong Chai, Jingdong Chen, Kecheng Zheng, Feng Zhao},
  journal={Advances in neural information processing systems},
  year={2025}
}
```