| | --- |
| | extra_gated_fields: |
| | Name: text |
| | Institute: text |
| | Institutional Email: text |
| | I agree to use this model for non-commercial use ONLY: checkbox |
| | --- |
| | |
| | # Model Card for NVG series |
| |
|
| | <p align="center"> |
| | <h1 align="center">Next Visual Granularity Generation</h1> |
| | <center>Yikai Wang, Zhouxia Wang, Zhonghua Wu, Qingyi Tao, Kang Liao, Chen Change Loy.<br> |
| | S-Lab, Nanyang Technological University; SenseTime Research<br> </center> |
| | <p align="center"> |
| | <a href="https://arxiv.org/abs/2508.12811"><img alt='arXiv' src="https://img.shields.io/badge/arXiv-2508.12811-b31b1b.svg"></a> |
| | <a href="https://yikai-wang.github.io/nvg/"><img alt='page' src="https://img.shields.io/badge/Project-Website-orange"></a> |
| | </p> |
| | </p> |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity.<br> |
| | Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels.<br> |
| | We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30 → 3.03, 2.57 → 2.44, 2.09 → 2.06). We also conduct extensive analysis to showcase the capability and potential of the NVG framework. Our code and models will be released.<br> |
| |
|
| | - **License:** S-Lab License 1.0 |
| |
|
| | ### Model Sources |
| |
|
| | <!-- Provide the basic links for the model. --> |
| |
|
| | - **Code:** https://github.com/Yikai-Wang/nvg |
| | - **Paper:** https://arxiv.org/abs/2508.12811 |
| |
|
| | ## Uses |
| |
|
| | Illustrated in the github repo. |
| |
|
| | ## Citation |
| |
|
| | <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
| |
|
| | **BibTeX:** |
| |
|
| | ``` |
| | @article{wang2025next, |
| | title={Next Visual Granularity Generation}, |
| | author={Wang, Yikai and Wang, Zhouxia and Wu, Zhonghua and Tao, Qingyi and Liao, Kang and Loy, Chen Change}, |
| | journal={arXiv preprint arXiv:2508.12811}, |
| | year={2025} |
| | } |
| | ``` |