| | --- |
| | title: VibeToken |
| | emoji: π¦ |
| | colorFrom: blue |
| | colorTo: red |
| | sdk: gradio |
| | sdk_version: 6.6.0 |
| | python_version: '3.12' |
| | app_file: app.py |
| | pinned: false |
| | license: mit |
| | --- |
| | |
| |
|
| | # [CVPR 2026] VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations |
| |
|
| | <p align="center"> |
| | <img src="assets/teaser.png" alt="VibeToken Teaser" width="100%"> |
| | </p> |
| |
|
| | <p align="center"> |
| | <b>CVPR 2026</b> | |
| | <a href="#">Paper</a> | |
| | <a href="#">Project Page</a> | |
| | <a href="#-checkpoints">Checkpoints</a> |
| | </p> |
| |
|
| | <p align="center"> |
| | <img src="https://img.shields.io/badge/CVPR-2026-blue" alt="CVPR 2026"> |
| | <img src="https://img.shields.io/badge/arXiv-TODO-b31b1b" alt="arXiv"> |
| | <img src="https://img.shields.io/badge/License-MIT-green" alt="License"> |
| | <a href="https://huggingface.co/mpatel57/VibeToken"><img src="https://img.shields.io/badge/π€-Model-yellow" alt="HuggingFace"></a> |
| | </p> |
| |
|
| | --- |
| |
|
| | We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to **arbitrary resolutions and aspect ratios**, narrowing the gap to diffusion models at scale. At its core is **VibeToken**, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32--256 tokens, achieving state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present **VibeToken-Gen**, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources. |
| |
|
| | ### π₯ Highlights |
| |
|
| | | | | |
| | |---|---| |
| | | π― **1024Γ1024 in just 64 tokens** | Achieves **3.94 gFID** vs. 5.87 gFID for diffusion-based SOTA (1,024 tokens) | |
| | | β‘ **Constant 179G FLOPs** | 63Γ more efficient than LlamaGen (11T FLOPs at 1024Γ1024) | |
| | | π **Resolution-agnostic** | Supports arbitrary resolutions and aspect ratios out of the box | |
| | | ποΈ **Dynamic token count** | User-controllable 32--256 tokens per image | |
| | | π **Native super-resolution** | Supports image super-resolution out of the box | |
| |
|
| |
|
| | ## π° News |
| |
|
| | - **[Feb 2026]** π VibeToken is accepted at **CVPR 2026**! |
| | - **[Feb 2026]** Training scripts released. |
| | - **[Feb 2026]** Inference code and checkpoints released. |
| |
|
| |
|
| | ## π Quick Start |
| |
|
| | ```bash |
| | # 1. Clone and setup |
| | git clone https://github.com/<your-org>/VibeToken.git |
| | cd VibeToken |
| | uv venv --python=3.11.6 |
| | source .venv/bin/activate |
| | uv pip install -r requirements.txt |
| | |
| | # 2. Download a checkpoint (see Checkpoints section below) |
| | mkdir -p checkpoints |
| | wget https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeToken_LL.bin -O ./checkpoints/VibeToken_LL.bin |
| | |
| | # 3. Reconstruct an image |
| | python reconstruct.py --auto \ |
| | --config configs/vibetoken_ll.yaml \ |
| | --checkpoint ./checkpoints/VibeToken_LL.bin \ |
| | --image ./assets/example_1.png \ |
| | --output ./assets/reconstructed.png |
| | ``` |
| |
|
| |
|
| | ## π¦ Checkpoints |
| |
|
| | All checkpoints are hosted on [Hugging Face](https://huggingface.co/mpatel57/VibeToken). |
| |
|
| | #### Reconstruction Checkpoints |
| |
|
| | | Name | Resolution | rFID (256 tokens) | rFID (64 tokens) | Download | |
| | |------|:----------:|:-----------------:|:----------------:|----------| |
| | | VibeToken-LL | 1024Γ1024 | 3.76 | 4.12 | [VibeToken_LL.bin](https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeToken_LL.bin) | |
| | | VibeToken-LL | 256Γ256 | 5.12 | 0.90 | same as above | |
| | | VibeToken-SL | 1024Γ1024 | 4.25 | 2.41 | [VibeToken_SL.bin](https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeToken_SL.bin) | |
| | | VibeToken-SL | 256Γ256 | 5.44 | 0.40 | same as above | |
| |
|
| | #### Generation Checkpoints |
| |
|
| | | Name | Training Resolution(s) | Tokens | Best gFID | Download | |
| | |------|:----------------------:|:------:|:---------:|----------| |
| | | VibeToken-Gen-B | 256Γ256 | 65 | 7.62 | [VibeTokenGen-b-fixed65_dynamic_1500k.pt](https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeTokenGen-b-fixed65_dynamic_1500k.pt) | |
| | | VibeToken-Gen-B | 1024Γ1024 | 65 | 7.37 | same as above | |
| | | VibeToken-Gen-XXL | 256Γ256 | 65 | 3.62 | [VibeTokenGen-xxl-dynamic-65_750k.pt](https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeTokenGen-xxl-dynamic-65_750k.pt) | |
| | | VibeToken-Gen-XXL | 1024Γ1024 | 65 | **3.54** | same as above | |
| |
|
| |
|
| | ## π οΈ Setup |
| |
|
| | ```bash |
| | uv venv --python=3.11.6 |
| | source .venv/bin/activate |
| | uv pip install -r requirements.txt |
| | ``` |
| |
|
| | > **Tip:** If you don't have `uv`, install it via `pip install uv` or see [uv docs](https://github.com/astral-sh/uv). Alternatively, use `python -m venv .venv && pip install -r requirements.txt`. |
| |
|
| |
|
| | ## πΌοΈ VibeToken Reconstruction |
| |
|
| | Download the VibeToken-LL checkpoint (see [Checkpoints](#-checkpoints)), then: |
| |
|
| | ```bash |
| | # Auto mode (recommended) -- automatically determines optimal patch sizes |
| | python reconstruct.py --auto \ |
| | --config configs/vibetoken_ll.yaml \ |
| | --checkpoint ./checkpoints/VibeToken_LL.bin \ |
| | --image ./assets/example_1.png \ |
| | --output ./assets/reconstructed.png |
| | |
| | # Manual mode -- specify patch sizes explicitly |
| | python reconstruct.py \ |
| | --config configs/vibetoken_ll.yaml \ |
| | --checkpoint ./checkpoints/VibeToken_LL.bin \ |
| | --image ./assets/example_1.png \ |
| | --output ./assets/reconstructed.png \ |
| | --encoder_patch_size 16 \ |
| | --decoder_patch_size 16 |
| | ``` |
| |
|
| | > **Note:** For best performance, the input image resolution should be a multiple of 32. Images with other resolutions are automatically rescaled to the nearest multiple of 32. |
| |
|
| |
|
| | ## π¨ VibeToken-Gen: ImageNet-1k Generation |
| |
|
| | Download both the VibeToken-LL and VibeToken-Gen-XXL checkpoints (see [Checkpoints](#-checkpoints)), then: |
| |
|
| | ```bash |
| | python generate.py \ |
| | --gpt-ckpt ./checkpoints/VibeTokenGen-xxl-dynamic-65_750k.pt \ |
| | --gpt-model GPT-XXL --num-output-layer 4 \ |
| | --num-codebooks 8 --codebook-size 32768 \ |
| | --image-size 256 --cfg-scale 4.0 --top-k 500 --temperature 1.0 \ |
| | --class-dropout-prob 0.1 \ |
| | --extra-layers "QKV" \ |
| | --latent-size 65 \ |
| | --config ./configs/vibetoken_ll.yaml \ |
| | --vq-ckpt ./checkpoints/VibeToken_LL.bin \ |
| | --sample-dir ./assets/ \ |
| | --skip-folder-creation \ |
| | --compile \ |
| | --decoder-patch-size 32,32 \ |
| | --target-resolution 1024,1024 \ |
| | --llamagen-target-resolution 256,256 \ |
| | --precision bf16 \ |
| | --global-seed 156464151 |
| | ``` |
| |
|
| | The `--target-resolution` controls the tokenizer output resolution, while `--llamagen-target-resolution` controls the generator's internal resolution (max 512Γ512; for higher resolutions, the tokenizer handles upscaling). |
| |
|
| |
|
| | ## ποΈ Training |
| |
|
| | To train the VibeToken tokenizer from scratch, please refer to [TRAIN.md](TRAIN.md) for detailed instructions. |
| |
|
| |
|
| | ## π Acknowledgement |
| |
|
| | We would like to acknowledge the following repositories that inspired our work and upon which we directly build: |
| | [1d-tokenizer](https://github.com/bytedance/1d-tokenizer), |
| | [LlamaGen](https://github.com/FoundationVision/LlamaGen), and |
| | [UniTok](https://github.com/FoundationVision/UniTok). |
| |
|
| |
|
| | ## π Citation |
| |
|
| | If you find VibeToken useful in your research, please consider citing: |
| |
|
| | ```bibtex |
| | @inproceedings{vibetoken2026, |
| | title = {VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations}, |
| | author = {Patel, Maitreya and Li, Jingtao and Zhuang, Weiming and Yang, Yezhou and Lyu, Lingjuan}, |
| | booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, |
| | year = {2026} |
| | } |
| | ``` |
| |
|
| | If you have any questions, feel free to open an issue or reach out! |
| |
|