NuanBaobao commited on
Commit
e8995f5
·
verified ·
1 Parent(s): 3f064d7

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -1,2 +1,11 @@
1
  *.pth filter=lfs diff=lfs merge=lfs -text
2
  *.whl filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
1
  *.pth filter=lfs diff=lfs merge=lfs -text
2
  *.whl filter=lfs diff=lfs merge=lfs -text
3
+ asset/abs.png filter=lfs diff=lfs merge=lfs -text
4
+ asset/attn_map.png filter=lfs diff=lfs merge=lfs -text
5
+ asset/attn_mask_compare.png filter=lfs diff=lfs merge=lfs -text
6
+ asset/intro_3.png filter=lfs diff=lfs merge=lfs -text
7
+ asset/motivation.png filter=lfs diff=lfs merge=lfs -text
8
+ asset/pipeline.png filter=lfs diff=lfs merge=lfs -text
9
+ asset/progressive.png filter=lfs diff=lfs merge=lfs -text
10
+ asset/results_1.png filter=lfs diff=lfs merge=lfs -text
11
+ asset/results_2.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,264 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+
3
+ <img src="asset/intro_3.png" style="border-radius: 8px">
4
+
5
+ # MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning (ICLR 2026)
6
+
7
+ [Jinhua Zhang](https://scholar.google.com/citations?user=tyYxiXoAAAAJ), [Wei Long](https://scholar.google.com/citations?user=CsVTBJoAAAAJ), [Minghao Han](https://scholar.google.com/citations?hl=en&user=IqrXj74AAAAJ), [Weiyi You](https://scholar.google.com/citations?user=q4uALoAAAAAJ), [Shuhang Gu](https://scholar.google.com/citations?user=-kSTt40AAAAJ)
8
+
9
+ [![arXiv](https://img.shields.io/badge/arXiv-2505.12742-b31b1b.svg)](https://arxiv.org/abs/2505.12742v3)
10
+ [![GitHub Stars](https://img.shields.io/github/stars/LabShuHangGU/MVAR?style=social)](https://github.com/LabShuHangGU/MVAR)
11
+ [![Project Page](https://img.shields.io/badge/Project-Page-green?style=flat&logo=googlechrome&logoColor=white)](https://nuanbaobao.github.io/MVAR)
12
+ [![huggingface](https://img.shields.io/badge/%F0%9F%A4%97%20Weights-CVLUESTC/MVAR-yellow)](https://huggingface.co/CVLUESTC/MVAR)
13
+
14
+ </div>
15
+
16
+ ⭐ If this work is helpful for you, please help star this repo. Thanks! 🤗
17
+
18
+ ---
19
+
20
+ ## ✨ Key Contributions
21
+
22
+ 1️⃣ **Efficiency Bottleneck:** VAR exhibits scale and spatial redundancy, causing high GPU memory consumption.
23
+
24
+ <p align="center">
25
+ <img src="asset/motivation.png" style="border-radius: 5px" width="80%">
26
+ </p>
27
+
28
+ 2️⃣ **Our Solution:** The proposed method enables MVAR generation **without relying on KV cache** during inference, significantly reducing the memory footprint.
29
+
30
+ <p align="center">
31
+ <img src="asset/abs.png" style="border-radius: 5px" width="80%">
32
+ </p>
33
+
34
+ ---
35
+
36
+ ## 📑 Contents
37
+
38
+ - [📚 Citation](#citation)
39
+ - [📰 News](#news)
40
+ - [🛠️ Pipeline](#pipeline)
41
+ - [🥇 Results](#results)
42
+ - [🦁 Model Zoo](#model-zoo)
43
+ - [⚙️ Installation](#installation)
44
+ - [🚀 Training & Evaluation](#training--evaluation)
45
+
46
+
47
+ ---
48
+ ## <a name="citation"></a> 📚 Citation
49
+
50
+ Please cite our work if it is helpful for your research:
51
+
52
+ ```bibtex
53
+ @article{zhang2025mvar,
54
+ title={MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning},
55
+ author={Zhang, Jinhua and Long, Wei and Han, Minghao and You, Weiyi and Gu, Shuhang},
56
+ journal={arXiv preprint arXiv:2505.12742},
57
+ year={2025}
58
+ }
59
+ ```
60
+
61
+ ## <a name="news"></a> 📰 News
62
+
63
+ - **2026-02-05:** 🧠 Codebase and Weights are now available.
64
+ - **2026-01-25:** 🚀 MVAR is accepted by **ICLR 2026**.
65
+ - **2025-05-20:** 📄 Our MVAR paper has been published on [arXiv](https://arxiv.org/abs/2505.12742).
66
+
67
+ ---
68
+
69
+ ## <a name="pipeline"></a> 🛠️ Pipeline
70
+
71
+ MVAR introduces the **Scale and Spatial Markovian Assumption**:
72
+ - **Scale Markovian:** Only adopts the adjacent preceding scale for next-scale prediction.
73
+ - **Spatial Markovian:** Restricts the attention of each token to a localized neighborhood of size $k$ at corresponding positions on adjacent scales.
74
+
75
+ <p align="center">
76
+ <img src="asset/pipeline.png" style="border-radius: 15px" width="90%">
77
+ </p>
78
+
79
+ ---
80
+
81
+ ## <a name="results"></a> 🥇 Results
82
+
83
+ MVAR achieves a **3.0× reduction** in GPU memory footprint compared to VAR.
84
+
85
+ <details>
86
+ <summary>📊 Comparison of Quantitative Results: MVAR vs. VAR (Click to expand)</summary>
87
+ <p align="center">
88
+ <img width="900" src="asset/results_1.png">
89
+ </p>
90
+ </details>
91
+
92
+ <details>
93
+ <summary>📈 ImageNet 256×256 Benchmark (Click to expand)</summary>
94
+ <p align="center">
95
+ <img width="500" src="asset/results_2.png">
96
+ </p>
97
+ </details>
98
+
99
+ <details>
100
+ <summary>🧪 Ablation Study on Markovian Assumptions (Click to expand)</summary>
101
+ <p align="center">
102
+ <img width="500" src="asset/progressive.png">
103
+ </p>
104
+ </details>
105
+
106
+ ---
107
+
108
+ ## <a name="model-zoo"></a> 🦁 MVAR Model Zoo
109
+
110
+ We provide various MVAR models accessible via our [Huggingface Repo](https://huggingface.co/CVLUESTC/MVAR).
111
+
112
+ ### 📊 Model Performance & Weights
113
+
114
+ | Model | FID ↓ | IS ↑ | sFID ↓ | Prec. ↑ | Recall ↑ | Params | HF Weights 🤗 |
115
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :--- |
116
+ | **MVAR-d16** | 3.01 | 285.17 | 6.26 | 0.85 | 0.51 | 310M | [link](https://huggingface.co/CVLUESTC/MVAR/resolve/main/mvar_d16.pth) |
117
+ | **MVAR-d16**$^{\dag}$ | 3.37 | 295.35 | 6.10 | 0.86 | 0.48 | 310M | [link](https://huggingface.co/CVLUESTC/MVAR/resolve/main/mvar_d20.pth) |
118
+ | **MVAR-d20**$^{\dag}$ | 2.83 | 294.31 | 6.12 | 0.85 | 0.52 | 600M | [link](https://huggingface.co/CVLUESTC/MVAR/resolve/main/mvar_d24.pth) |
119
+ | **MVAR-d24**$^{\dag}$ | 2.15 | 298.85 | 5.62 | 0.84 | 0.56 | 1.0B | [link](https://huggingface.co/CVLUESTC/MVAR/resolve/main/mvar_d30.pth) |
120
+
121
+ > **Note:** $^{\dag}$ indicates models fine-tuned from VAR weights on ImageNet.
122
+
123
+ ---
124
+
125
+ ## <a name="installation"></a> ⚙️ Installation
126
+
127
+ 1. **Create conda environment:**
128
+ ```bash
129
+ conda create -n mvar python=3.11 -y
130
+ conda activate mvar
131
+ ```
132
+
133
+ 2. **Install PyTorch and dependencies:**
134
+ ```bash
135
+ pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 \
136
+ xformers==0.0.32.post2 \
137
+ --index-url https://download.pytorch.org/whl/cu128
138
+
139
+ pip install accelerate einops tqdm huggingface_hub pytz tensorboard \
140
+ transformers typed-argument-parser thop matplotlib seaborn wheel \
141
+ scipy packaging ninja openxlab lmdb pillow
142
+ ```
143
+
144
+ 3. **Install [Neighborhood Attention](https://natten.org/install/):**
145
+ ```bash
146
+ # Or use the .whl file provided in [HuggingFace](https://huggingface.co/CVLUESTC/MVAR)
147
+ pip install natten-0.21.1+torch280cu128-cp311-cp311-linux_x86_64.whl
148
+ ```
149
+
150
+ 4. **Prepare [ImageNet](http://image-net.org/) dataset:**
151
+ <details>
152
+ <summary>Click to view expected directory structure</summary>
153
+
154
+ ```
155
+ /path/to/imagenet/:
156
+ train/:
157
+ n01440764/
158
+ ...
159
+ val/:
160
+ n01440764/
161
+ ...
162
+ ```
163
+
164
+ </details>
165
+
166
+
167
+ ---
168
+
169
+ ## <a name="training--evaluation"></a> 🚀 Training & Evaluation
170
+
171
+ ### 1.Requirements (Pre-trained VAR)
172
+
173
+ Before running MVAR, you must download the necessary [VAR](https://huggingface.co/FoundationVision/var/) weight first:
174
+
175
+ You can use the `huggingface-cli` to download the entire model repository:
176
+
177
+ ```bash
178
+ # Install huggingface_hub if you haven't
179
+ pip install huggingface_hub
180
+ # Download models to local directory
181
+ hf download FoundationVision/var --local-dir ./pretrained/FoundationVision/var
182
+ ```
183
+
184
+ ### 2.Download [MVAR](https://huggingface.co/CVLUESTC/MVAR)
185
+
186
+ ```bash
187
+ # Download models to local directory
188
+ hf download CVLUESTC/MVAR --local-dir ./checkpoints
189
+ ```
190
+
191
+ ### 3.Flash-Attn and Xformers (Optional)
192
+
193
+ Install and compile `flash-attn` and `xformers` for faster attention computation. Our code will automatically use them if installed. See [models/basic_mvar.py#L17-L48](models/basic_mvar.py#L17-L48).
194
+
195
+
196
+
197
+ ### 4.Caching VQ-VAE Latents and Code Index (Optional)
198
+
199
+ Given that our data augmentation consists of simple center cropping and random flipping, VQ-VAE latents and code indices can be pre-computed and saved to `CACHED_PATH` tto reduce computational overhead during MVAR training:
200
+
201
+ ```bash
202
+ torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 main_cache.py \
203
+ --img_size 256 --data_path ${IMAGENET_PATH} \
204
+ --cached_path ${CACHED_PATH}/train_cache_mvar \ # or ${CACHED_PATH}/val_cache_mvar
205
+ --train \ # specify train
206
+ ```
207
+
208
+ ### 5.Training Scripts
209
+
210
+ To train MVAR on ImageNet 256x256, you can use `--use_cached=True` to use the pre-computed cached latents and code index:
211
+
212
+ ```bash
213
+ # Example for MVAR-d16
214
+ torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
215
+ --depth=16 --bs=448 --ep=300 --fp16=1 --alng=1e-3 --wpe=0.1 \
216
+ --data_path ${IMAGENET_PATH} --exp_name ${EXP_NAME}
217
+
218
+ # Example for MVAR-d16 (Fine-tuning)
219
+ torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
220
+ --depth=16 --bs=448 --ep=80 --fp16=1 --alng=1e-3 --wpe=0.1 \
221
+ --data_path ${IMAGENET_PATH} --exp_name ${EXP_NAME} --finetune_from_var=True
222
+
223
+ # Example for MVAR-d20 (Fine-tuning)
224
+ torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
225
+ --depth=20 --bs=192 --ep=80 --fp16=1 --alng=1e-3 --wpe=0.1 \
226
+ --data_path ${IMAGENET_PATH} --exp_name ${EXP_NAME} --finetune_from_var=True
227
+
228
+ # Example for MVAR-d24 (Fine-tuning)
229
+ torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
230
+ --depth=24 --bs=448 --ep=80 --fp16=1 --alng=1e-3 --wpe=0.1 \
231
+ --data_path ${IMAGENET_PATH} --exp_name ${EXP_NAME} --finetune_from_var=True
232
+ ```
233
+
234
+ ### 6.Sampling & FID Evaluation
235
+
236
+ 6.1. **Generate images:**
237
+ ```bash
238
+ python run_mvar_evaluate.py \
239
+ --cfg 2.7 --top_p 0.99 --top_k 1200 --depth 16 \
240
+ --mvar_ckpt ${MVAR_CKPT}
241
+ ```
242
+
243
+
244
+ *Suggested CFG for models:*
245
+ * **d16:** cfg=2.7, top_p=0.99, top_k=1200
246
+ * **d16:**$^{\dag}$ cfg=2.0, top_p=0.99, top_k=1200
247
+ * **d20:**$^{\dag}$ cfg=1.5, top_p=0.96, top_k=900
248
+ * **d24:**$^{\dag}$ cfg=1.4, top_p=0.96, top_k=900
249
+
250
+
251
+ 6.2. **Run evaluation:**
252
+ ```bash
253
+ python utils/evaluations/c2i/evaluator.py \
254
+ --ref_batch VIRTUAL_imagenet256_labeled.npz \
255
+ --sample_batch ${SAMPLE_BATCH}
256
+ ```
257
+
258
+ ---
259
+
260
+
261
+ ## 📩 Contact
262
+
263
+ If you have any questions, feel free to reach out at [jinhua.zjh@gmail.com](mailto:jinhua.zjh@gmail.com).
264
+
asset/abs.png ADDED

Git LFS Details

  • SHA256: a512e17a78e669bd1a5e9a8651912ae7995ebf053048d9d96931607214acb859
  • Pointer size: 132 Bytes
  • Size of remote file: 1.53 MB
asset/attn_map.png ADDED

Git LFS Details

  • SHA256: be61ac1556b934149d7ec4963564a31110beba50e91bd467990c4e5d99d0ce0a
  • Pointer size: 131 Bytes
  • Size of remote file: 431 kB
asset/attn_mask_compare.png ADDED

Git LFS Details

  • SHA256: eb844ff5ec0f7872dcf33aa046178e9b2e1fba009b3c5617c19a9b9540799788
  • Pointer size: 131 Bytes
  • Size of remote file: 149 kB
asset/intro_3.png ADDED

Git LFS Details

  • SHA256: 22de1c240f9f9b73b81f785f6a99412637f16324479eb5196240e98d05aa104d
  • Pointer size: 131 Bytes
  • Size of remote file: 570 kB
asset/motivation.png ADDED

Git LFS Details

  • SHA256: 3d09151eea23f2ee1ccf9b276570d6f8c7022b831a18a804fba5351fa2a0207b
  • Pointer size: 131 Bytes
  • Size of remote file: 311 kB
asset/pipeline.png ADDED

Git LFS Details

  • SHA256: 8d124880dd2d9412da94e8444a09566a8c2bd4b2738fd61336e8d0219295c04f
  • Pointer size: 131 Bytes
  • Size of remote file: 498 kB
asset/progressive.png ADDED

Git LFS Details

  • SHA256: abbad7f533a3db00e6d79c65d384d86f06b115955d68ea1f4d5dca977da48ca9
  • Pointer size: 132 Bytes
  • Size of remote file: 1.02 MB
asset/results_1.png ADDED

Git LFS Details

  • SHA256: bb10696f5ca4083cf290e3a6da14c7d6d9de1de628343528c95f2c1cdf735f70
  • Pointer size: 131 Bytes
  • Size of remote file: 322 kB
asset/results_2.png ADDED

Git LFS Details

  • SHA256: 089cec05c980976d322c691c43f83f44ccf6ece91127490f90dc8d350a1181f7
  • Pointer size: 131 Bytes
  • Size of remote file: 367 kB