File size: 7,467 Bytes
6163761 7bef20f a328d4d 7bef20f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 | ---
title: VibeToken
emoji: π¦
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 6.6.0
python_version: '3.12'
app_file: app.py
pinned: false
license: mit
---
# [CVPR 2026] VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
<p align="center">
<img src="assets/teaser.png" alt="VibeToken Teaser" width="100%">
</p>
<p align="center">
<b>CVPR 2026</b> |
<a href="#">Paper</a> |
<a href="#">Project Page</a> |
<a href="#-checkpoints">Checkpoints</a>
</p>
<p align="center">
<img src="https://img.shields.io/badge/CVPR-2026-blue" alt="CVPR 2026">
<img src="https://img.shields.io/badge/arXiv-TODO-b31b1b" alt="arXiv">
<img src="https://img.shields.io/badge/License-MIT-green" alt="License">
<a href="https://huggingface.co/mpatel57/VibeToken"><img src="https://img.shields.io/badge/π€-Model-yellow" alt="HuggingFace"></a>
</p>
---
We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to **arbitrary resolutions and aspect ratios**, narrowing the gap to diffusion models at scale. At its core is **VibeToken**, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32--256 tokens, achieving state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present **VibeToken-Gen**, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources.
### π₯ Highlights
| | |
|---|---|
| π― **1024Γ1024 in just 64 tokens** | Achieves **3.94 gFID** vs. 5.87 gFID for diffusion-based SOTA (1,024 tokens) |
| β‘ **Constant 179G FLOPs** | 63Γ more efficient than LlamaGen (11T FLOPs at 1024Γ1024) |
| π **Resolution-agnostic** | Supports arbitrary resolutions and aspect ratios out of the box |
| ποΈ **Dynamic token count** | User-controllable 32--256 tokens per image |
| π **Native super-resolution** | Supports image super-resolution out of the box |
## π° News
- **[Feb 2026]** π VibeToken is accepted at **CVPR 2026**!
- **[Feb 2026]** Training scripts released.
- **[Feb 2026]** Inference code and checkpoints released.
## π Quick Start
```bash
# 1. Clone and setup
git clone https://github.com/<your-org>/VibeToken.git
cd VibeToken
uv venv --python=3.11.6
source .venv/bin/activate
uv pip install -r requirements.txt
# 2. Download a checkpoint (see Checkpoints section below)
mkdir -p checkpoints
wget https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeToken_LL.bin -O ./checkpoints/VibeToken_LL.bin
# 3. Reconstruct an image
python reconstruct.py --auto \
--config configs/vibetoken_ll.yaml \
--checkpoint ./checkpoints/VibeToken_LL.bin \
--image ./assets/example_1.png \
--output ./assets/reconstructed.png
```
## π¦ Checkpoints
All checkpoints are hosted on [Hugging Face](https://huggingface.co/mpatel57/VibeToken).
#### Reconstruction Checkpoints
| Name | Resolution | rFID (256 tokens) | rFID (64 tokens) | Download |
|------|:----------:|:-----------------:|:----------------:|----------|
| VibeToken-LL | 1024Γ1024 | 3.76 | 4.12 | [VibeToken_LL.bin](https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeToken_LL.bin) |
| VibeToken-LL | 256Γ256 | 5.12 | 0.90 | same as above |
| VibeToken-SL | 1024Γ1024 | 4.25 | 2.41 | [VibeToken_SL.bin](https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeToken_SL.bin) |
| VibeToken-SL | 256Γ256 | 5.44 | 0.40 | same as above |
#### Generation Checkpoints
| Name | Training Resolution(s) | Tokens | Best gFID | Download |
|------|:----------------------:|:------:|:---------:|----------|
| VibeToken-Gen-B | 256Γ256 | 65 | 7.62 | [VibeTokenGen-b-fixed65_dynamic_1500k.pt](https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeTokenGen-b-fixed65_dynamic_1500k.pt) |
| VibeToken-Gen-B | 1024Γ1024 | 65 | 7.37 | same as above |
| VibeToken-Gen-XXL | 256Γ256 | 65 | 3.62 | [VibeTokenGen-xxl-dynamic-65_750k.pt](https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeTokenGen-xxl-dynamic-65_750k.pt) |
| VibeToken-Gen-XXL | 1024Γ1024 | 65 | **3.54** | same as above |
## π οΈ Setup
```bash
uv venv --python=3.11.6
source .venv/bin/activate
uv pip install -r requirements.txt
```
> **Tip:** If you don't have `uv`, install it via `pip install uv` or see [uv docs](https://github.com/astral-sh/uv). Alternatively, use `python -m venv .venv && pip install -r requirements.txt`.
## πΌοΈ VibeToken Reconstruction
Download the VibeToken-LL checkpoint (see [Checkpoints](#-checkpoints)), then:
```bash
# Auto mode (recommended) -- automatically determines optimal patch sizes
python reconstruct.py --auto \
--config configs/vibetoken_ll.yaml \
--checkpoint ./checkpoints/VibeToken_LL.bin \
--image ./assets/example_1.png \
--output ./assets/reconstructed.png
# Manual mode -- specify patch sizes explicitly
python reconstruct.py \
--config configs/vibetoken_ll.yaml \
--checkpoint ./checkpoints/VibeToken_LL.bin \
--image ./assets/example_1.png \
--output ./assets/reconstructed.png \
--encoder_patch_size 16 \
--decoder_patch_size 16
```
> **Note:** For best performance, the input image resolution should be a multiple of 32. Images with other resolutions are automatically rescaled to the nearest multiple of 32.
## π¨ VibeToken-Gen: ImageNet-1k Generation
Download both the VibeToken-LL and VibeToken-Gen-XXL checkpoints (see [Checkpoints](#-checkpoints)), then:
```bash
python generate.py \
--gpt-ckpt ./checkpoints/VibeTokenGen-xxl-dynamic-65_750k.pt \
--gpt-model GPT-XXL --num-output-layer 4 \
--num-codebooks 8 --codebook-size 32768 \
--image-size 256 --cfg-scale 4.0 --top-k 500 --temperature 1.0 \
--class-dropout-prob 0.1 \
--extra-layers "QKV" \
--latent-size 65 \
--config ./configs/vibetoken_ll.yaml \
--vq-ckpt ./checkpoints/VibeToken_LL.bin \
--sample-dir ./assets/ \
--skip-folder-creation \
--compile \
--decoder-patch-size 32,32 \
--target-resolution 1024,1024 \
--llamagen-target-resolution 256,256 \
--precision bf16 \
--global-seed 156464151
```
The `--target-resolution` controls the tokenizer output resolution, while `--llamagen-target-resolution` controls the generator's internal resolution (max 512Γ512; for higher resolutions, the tokenizer handles upscaling).
## ποΈ Training
To train the VibeToken tokenizer from scratch, please refer to [TRAIN.md](TRAIN.md) for detailed instructions.
## π Acknowledgement
We would like to acknowledge the following repositories that inspired our work and upon which we directly build:
[1d-tokenizer](https://github.com/bytedance/1d-tokenizer),
[LlamaGen](https://github.com/FoundationVision/LlamaGen), and
[UniTok](https://github.com/FoundationVision/UniTok).
## π Citation
If you find VibeToken useful in your research, please consider citing:
```bibtex
@inproceedings{vibetoken2026,
title = {VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations},
author = {Patel, Maitreya and Li, Jingtao and Zhuang, Weiming and Yang, Yezhou and Lyu, Lingjuan},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}
```
If you have any questions, feel free to open an issue or reach out!
|