Add comprehensive model card for RobusTok (Image Tokenizer Needs Post-Training)

#3
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +231 -1
README.md CHANGED
@@ -1 +1,231 @@
1
- Image Tokenizers Needs Post-Training
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: unconditional-image-generation
3
+ license: apache-2.0
4
+ ---
5
+
6
+ # Image Tokenizer Needs Post-Training (RobusTok)
7
+
8
+ This repository contains **RobusTok**, the image tokenizer and generator introduced in the paper [Image Tokenizer Needs Post-Training](https://huggingface.co/papers/2509.12474).
9
+
10
+ Project Page: [https://qiuk2.github.io/works/RobusTok/index.html](https://qiuk2.github.io/works/RobusTok/index.html)
11
+ GitHub Repository: [https://github.com/qiuk2/RobusTok](https://github.com/qiuk2/RobusTok)
12
+
13
+ <div align="center">
14
+ <img src="https://github.com/qiuk2/RobusTok/raw/main/assets/teaser.png" alt="Teaser" width="95%">
15
+ </div>
16
+
17
+ ## Abstract
18
+ Recent image generative models typically capture the image distribution in a pre-constructed latent space, relying on a frozen image tokenizer. However, there exists a significant discrepancy between the reconstruction and generation distribution, where current tokenizers only prioritize the reconstruction task that happens before generative training without considering the generation errors during sampling. In this paper, we comprehensively analyze the reason for this discrepancy in a discrete latent space, and, from which, we propose a novel tokenizer training scheme including both main-training and post-training, focusing on improving latent space construction and decoding respectively. During the main training, a latent perturbation strategy is proposed to simulate sampling noises, \ie, the unexpected tokens generated in generative inference. Specifically, we propose a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer, thus boosting the generation quality and convergence speed, and a novel tokenizer evaluation metric, \ie, pFID, which successfully correlates the tokenizer performance to generation quality. During post-training, we further optimize the tokenizer decoder regarding a well-trained generative model to mitigate the distribution difference between generated and reconstructed tokens. With a $\sim$400M generator, a discrete tokenizer trained with our proposed main training achieves a notable 1.60 gFID and further obtains 1.36 gFID with the additional post-training. Further experiments are conducted to broadly validate the effectiveness of our post-training strategy on off-the-shelf discrete and continuous tokenizers, coupled with autoregressive and diffusion-based generators.
19
+
20
+ ---
21
+
22
+ ## TL;DR
23
+
24
+ We present RobusTok, a new image tokenizer with a two-stage training scheme:
25
+
26
+ **Main training** β†’ constructs a robust latent space.
27
+
28
+ **Post-training** β†’ aligns the generator’s latent distribution with its image space.
29
+
30
+ ## Key highlights of Post-Training
31
+
32
+ - πŸš€ **Better generative quality**: gFID 1.60 β†’ 1.36.
33
+ - πŸ”‘ **Generalizability**: applicable to both autoregressive & diffusion models.
34
+ - ⚑ **Efficiency**: strong results with only ~400M generative models.
35
+
36
+ ---
37
+
38
+ ## Model Zoo
39
+ | Generator \ Tokenizer | RobusTok w/o. P.T([weights](https://huggingface.co/qiuk6/RobusTok/resolve/main/main-train.pt?download=true)) | RobusTok w/. P.T ([weights](https://huggingface.co/qiuk6/RobusTok/resolve/main/post-train.pt?download=true)) |
40
+ |---|---:|---:|
41
+ | Base ([weights](https://huggingface.co/qiuk6/RobusTok/resolve/main/rar_b.bin?download=true)) | gFID = 1.83 | gFID = 1.60 |
42
+ | Large ([weights](https://huggingface.co/qiuk6/RobusTok/resolve/main/rar_l.bin?download=true)) | gFID = 1.60 | gFID = 1.36 |
43
+
44
+ ---
45
+
46
+ ## Updates
47
+ - (2025.09.16) Paper released in Arxiv.
48
+ - (2025.09.18) Code and checkpoint are released. Preparing for PFID calculation
49
+
50
+ ---
51
+
52
+ ## Installation
53
+
54
+ Install all packages as:
55
+
56
+ ```bash
57
+ conda env create -f environment.yml
58
+ ```
59
+
60
+ ---
61
+
62
+ ## Dataset
63
+
64
+ We download the ImageNet2012 from the website and collect it as:
65
+
66
+ ```
67
+ ImageNet2012
68
+ β”œβ”€β”€ train
69
+ └── val
70
+ ```
71
+
72
+ If you want to train or finetune on other datasets, collect them in the format that ImageFolder (pytorch's [ImageFolder](https://pytorch.org/vision/main/generated/torchvision.datasets.ImageFolder.html)) can recognize.
73
+
74
+ ```
75
+ Dataset
76
+ β”œβ”€β”€ train
77
+ β”‚ β”œβ”€β”€ Class1
78
+ β”‚ β”‚ β”œβ”€β”€ 1.png
79
+ β”‚ β”‚ └── 2.png
80
+ β”‚ β”œβ”€β”€ Class2
81
+ β”‚ β”‚ β”œβ”€β”€ 1.png
82
+ β”‚ β”‚ └── 2.png
83
+ β”œβ”€β”€ val
84
+ ```
85
+
86
+ ---
87
+
88
+ ## Main Train for tokenizer
89
+
90
+ Please login to Wandb first using:
91
+
92
+ ```bash
93
+ wandb login
94
+ ```
95
+
96
+ rFID will be automatically evaluated and reported on Wandb. The checkpoint with the best rFID on the val set will be saved. We provide basic configurations in the "configs" folder.
97
+
98
+ Warning❗️: You may want to modify the metric to save models as rFID is not closely correlated to gFID. PSNR and SSIM are also good choices.
99
+
100
+ ```bash
101
+ torchrun --nproc_per_node=8 tokenizer/tokenizer_image/main_train.py --config configs/main-train.yaml
102
+ ```
103
+
104
+ Please modify the configuration file as needed for your specific dataset. We list some important ones here.
105
+
106
+ ```yaml
107
+ vq_ckpt: ckpt_best.pt # resume
108
+ cloud_save_path: output/exp-xx # output dir
109
+ data_path: ImageNet2012/train # training set dir
110
+ val_data_path: ImageNet2012/val # val set dir
111
+ enc_tuning_method: 'full' # ['full', 'lora', 'frozen']
112
+ dec_tuning_method: 'full' # ['full', 'lora', 'frozen']
113
+ codebook_embed_dim: 32 # codebook dim
114
+ codebook_size: 4096 # codebook size
115
+ product_quant: 1 # vanilla VQ
116
+ v_patch_nums: [16,] # latent resolution for RQ ([16,] is equivalent to vanilla VQ)
117
+ codebook_drop: 0.1 # quantizer dropout rate if RQ is applied
118
+ semantic_guide: dinov2 # ['none', 'dinov2', 'clip']
119
+ disc_epoch_start: 56 # epoch that discriminator starts
120
+ disc_type: dinodisc # discriminator type
121
+ disc_adaptive_weight: true # adaptive weight for discriminator loss
122
+ ema: true # use ema to update the model
123
+ num_latent_code: 256 # latent token number (must equals to the v_patch_nums[-1] ** 2οΌ‰
124
+ ```
125
+ ---
126
+
127
+ ## Training code for Generator
128
+
129
+ We follow [RAR](https://github.com/bytedance/1d-tokenizer) to pretokenize the whole dataset for speed-up the training process. We have uploaded [it](https://huggingface.co/qiuk6/RobustTok/resolve/main/RobustTok-half-pretokenized.jsonl?download=true) so you can train RobusTok-RAR directly.
130
+
131
+ ```bash
132
+ # training code for rar-b
133
+ accelerate launch scripts/train_rar.py experiment.project="rar" experiment.name="rar_b" experiment.output_dir="rar_b" model.generator.hidden_size=768 model.generator.num_hidden_layers=24 model.generator.num_attention_heads=16 model.generator.intermediate_size=3072 config=configs/generator/rar.yaml dataset.params.pretokenization=/path/to/pretokenized.jsonl model.vq_ckpt=/path/to/RobustTok.pt
134
+
135
+ # training code for rar-l
136
+ accelerate launch scripts/train_rar.py experiment.project="rar" experiment.name="rar_l" experiment.output_dir="rar_l" model.generator.hidden_size=1024 model.generator.num_hidden_layers=24 model.generator.num_attention_heads=16 model.generator.intermediate_size=4096 config=configs/generator/rar.yaml dataset.params.pretokenization=/path/to/pretokenized.jsonl model.vq_ckpt=/path/to/RobustTok.pt
137
+ ```
138
+
139
+ ---
140
+
141
+ ## Post-Training for Tokenizer
142
+
143
+ For post-training, we need to (1) prepare paired dataset and (2) post-train our decoder to align with generated latent space.
144
+
145
+ ### Prepare data
146
+ You can follow our code with your desired dataset / &sigma; / number to generate data:
147
+ ```bash
148
+ torchrun --nnodes=1 --nproc_per_node=8 --rdzv-endpoint=localhost:9999 post_train_data.py config=configs/generator/rar.yaml \
149
+ experiment.output_dir="/path/to/data-folder" \
150
+ experiment.generator_checkpoint="rar_b.bin" \
151
+ model.vq_ckpt=/path/to/RobustTok.pt \
152
+ model.generator.hidden_size=768 \
153
+ model.generator.num_hidden_layers=24 \
154
+ model.generator.num_attention_heads=16 \
155
+ model.generator.intermediate_size=3072 \
156
+ model.generator.randomize_temperature=1.02 \
157
+ model.generator.guidance_scale=6.0 \
158
+ model.generator.guidance_scale_pow=1.15 \
159
+ --sigma 0.7 --data-path /path/to/imagenet --num_samples /number/of/generate
160
+ ```
161
+
162
+ ### Post-Training
163
+
164
+ ```bash
165
+ torchrun --nproc_per_node=8 tokenizer/tokenizer_image/xqgan_post_train.py --config configs/post-train.yaml --data-path /path/to/data-folder --pair-set /path/to/imagenet --vq-ckpt /path/to/main-train/ckpt
166
+ ```
167
+
168
+ ---
169
+
170
+ ## Inference Code
171
+
172
+ ```bash
173
+ # Reproducing RAR-B
174
+ torchrun --nnodes=1 --nproc_per_node=8 --rdzv-endpoint=localhost:9999 sample_imagenet_rar.py config=configs/generator/rar.yaml \
175
+ experiment.output_dir="rar_b" \
176
+ experiment.generator_checkpoint="rar_b.bin" \
177
+ model.vq_ckpt=/path/to/RobustTok.pt \
178
+ model.generator.hidden_size=768 \
179
+ model.generator.num_hidden_layers=24 \
180
+ model.generator.num_attention_heads=16 \
181
+ model.generator.intermediate_size=3072 \
182
+ model.generator.randomize_temperature=1.02 \
183
+ model.generator.guidance_scale=6.0 \
184
+ model.generator.guidance_scale_pow=1.15
185
+ # Run eval script. The result FID should be ~1.83 before post-training and ~1.60 after post-training
186
+ python3 evaluator.py VIRTUAL_imagenet256_labeled.npz rar_b.npz
187
+
188
+ # Reproducing RAR-L
189
+ torchrun --nnodes=1 --nproc_per_node=8 --rdzv-endpoint=localhost:9999 sample_imagenet_rar.py config=configs/generator/rar.yaml \
190
+ experiment.output_dir="rar_l" \
191
+ experiment.generator_checkpoint="rar_l.bin" \
192
+ model.vq_ckpt=/path/to/RobustTok.pt \
193
+ model.generator.hidden_size=1024 \
194
+ model.generator.num_hidden_layers=24 \
195
+ model.generator.num_attention_heads=16 \
196
+ model.generator.intermediate_size=4096 \
197
+ model.generator.randomize_temperature=1.04 \
198
+ model.generator.guidance_scale=6.75 \
199
+ model.generator.guidance_scale_pow=1.01
200
+ # Run eval script. The result FID should be ~1.60 before post-training and ~1.36 after post-training
201
+ python3 evaluator.py VIRTUAL_imagenet256_labeled.npz rar_l.npz
202
+ ```
203
+
204
+ ---
205
+
206
+ ## Visualization
207
+
208
+ <div align="center">
209
+ <img src="https://github.com/qiuk2/RobusTok/raw/main/assets/ft-diff.png" alt="vis" width="95%">
210
+ <p>
211
+ visualization of 256&times;256 image generation before (top) and after (bottom) post-training. Three improvements are observed: (a) OOD mitigation, (b) Color fidelity, (c) detail refinement.
212
+ </p>
213
+ </div>
214
+
215
+ ---
216
+
217
+ ## Citation
218
+
219
+ If our work assists your research, feel free to give us a star ⭐ or cite us using:
220
+
221
+ ```bibtex
222
+ @misc{qiu2025imagetokenizerneedsposttraining,
223
+ title={Image Tokenizer Needs Post-Training},
224
+ author={Kai Qiu and Xiang Li and Hao Chen and Jason Kuen and Xiaohao Xu and Jiuxiang Gu and Yinyi Luo and Bhiksha Raj and Zhe Lin and Marios Savvides},
225
+ year={2025},
226
+ eprint={2509.12474},
227
+ archivePrefix={arXiv},
228
+ primaryClass={cs.CV},
229
+ url={https://arxiv.org/abs/2509.12474},
230
+ }
231
+ ```