---
license: other
license_name: modified-mit
license_link: https://github.com/MiniMax-AI/VTP/blob/main/LICENSE
language:
- en
pipeline_tag: image-feature-extraction
library_name: transformers
---
Towards Scalable Pre-training of Visual Tokenizers for Generation
[Jingfeng Yao](https://github.com/JingfengYao)
1, [Yuda Song](https://github.com/IDKiro)
2, Yucong Zhou
2, [Xinggang Wang](https://xwcv.github.io/)
1,*
1Huazhong University of Science and Technology
2MiniMax
*Corresponding author: xgwang@hust.edu.cn
***Work still in Progress.***
[](https://www.minimax.io/)
[](https://www.minimax.io/news/minimax-hailuo-23)
[](https://github.com/hustvl)
[](https://huggingface.co/MiniMaxAI/VTP-Large-f16d64)
[](https://github.com/MiniMax-AI/VTP)
[](https://arxiv.org/abs/2512.13687)
## News
- **[2025.12.16]** We have released our [technical report](https://arxiv.org/abs/2512.13687) and [pretrained weights](#get-checkpoints).
## Takeaways
By integrating contrastive, self-supervised, and reconstruction learning, we have trained numerous visual tokenizers from scratch. We are seeking to unveil the novel scalability interlinking understanding, generation, and reconstruction.
- **Same FLOPs in DiT Training, VTP scaling helps better generation.**
- **Traditional auto-encoders CANNOT be scaled up for diffusion generative models.**
- **Understanding is the key driver for improving the learnability scaling.**
- **Parameter, data and training scalability can be seen while representation learning involved.**