---
title: VibeToken
emoji: 🦀
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 6.6.0
python_version: '3.12'
app_file: app.py
pinned: false
license: mit
---
# [CVPR 2026] VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
CVPR 2026 |
Paper |
Project Page |
Checkpoints
---
We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to **arbitrary resolutions and aspect ratios**, narrowing the gap to diffusion models at scale. At its core is **VibeToken**, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32--256 tokens, achieving state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present **VibeToken-Gen**, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources.
### 🔥 Highlights
| | |
|---|---|
| 🎯 **1024×1024 in just 64 tokens** | Achieves **3.94 gFID** vs. 5.87 gFID for diffusion-based SOTA (1,024 tokens) |
| ⚡ **Constant 179G FLOPs** | 63× more efficient than LlamaGen (11T FLOPs at 1024×1024) |
| 🌐 **Resolution-agnostic** | Supports arbitrary resolutions and aspect ratios out of the box |
| 🎛️ **Dynamic token count** | User-controllable 32--256 tokens per image |
| 🔍 **Native super-resolution** | Supports image super-resolution out of the box |
## 📰 News
- **[Feb 2026]** 🎉 VibeToken is accepted at **CVPR 2026**!
- **[Feb 2026]** Training scripts released.
- **[Feb 2026]** Inference code and checkpoints released.
## 🚀 Quick Start
```bash
# 1. Clone and setup
git clone https://github.com//VibeToken.git
cd VibeToken
uv venv --python=3.11.6
source .venv/bin/activate
uv pip install -r requirements.txt
# 2. Download a checkpoint (see Checkpoints section below)
mkdir -p checkpoints
wget https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeToken_LL.bin -O ./checkpoints/VibeToken_LL.bin
# 3. Reconstruct an image
python reconstruct.py --auto \
--config configs/vibetoken_ll.yaml \
--checkpoint ./checkpoints/VibeToken_LL.bin \
--image ./assets/example_1.png \
--output ./assets/reconstructed.png
```
## 📦 Checkpoints
All checkpoints are hosted on [Hugging Face](https://huggingface.co/mpatel57/VibeToken).
#### Reconstruction Checkpoints
| Name | Resolution | rFID (256 tokens) | rFID (64 tokens) | Download |
|------|:----------:|:-----------------:|:----------------:|----------|
| VibeToken-LL | 1024×1024 | 3.76 | 4.12 | [VibeToken_LL.bin](https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeToken_LL.bin) |
| VibeToken-LL | 256×256 | 5.12 | 0.90 | same as above |
| VibeToken-SL | 1024×1024 | 4.25 | 2.41 | [VibeToken_SL.bin](https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeToken_SL.bin) |
| VibeToken-SL | 256×256 | 5.44 | 0.40 | same as above |
#### Generation Checkpoints
| Name | Training Resolution(s) | Tokens | Best gFID | Download |
|------|:----------------------:|:------:|:---------:|----------|
| VibeToken-Gen-B | 256×256 | 65 | 7.62 | [VibeTokenGen-b-fixed65_dynamic_1500k.pt](https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeTokenGen-b-fixed65_dynamic_1500k.pt) |
| VibeToken-Gen-B | 1024×1024 | 65 | 7.37 | same as above |
| VibeToken-Gen-XXL | 256×256 | 65 | 3.62 | [VibeTokenGen-xxl-dynamic-65_750k.pt](https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeTokenGen-xxl-dynamic-65_750k.pt) |
| VibeToken-Gen-XXL | 1024×1024 | 65 | **3.54** | same as above |
## 🛠️ Setup
```bash
uv venv --python=3.11.6
source .venv/bin/activate
uv pip install -r requirements.txt
```
> **Tip:** If you don't have `uv`, install it via `pip install uv` or see [uv docs](https://github.com/astral-sh/uv). Alternatively, use `python -m venv .venv && pip install -r requirements.txt`.
## 🖼️ VibeToken Reconstruction
Download the VibeToken-LL checkpoint (see [Checkpoints](#-checkpoints)), then:
```bash
# Auto mode (recommended) -- automatically determines optimal patch sizes
python reconstruct.py --auto \
--config configs/vibetoken_ll.yaml \
--checkpoint ./checkpoints/VibeToken_LL.bin \
--image ./assets/example_1.png \
--output ./assets/reconstructed.png
# Manual mode -- specify patch sizes explicitly
python reconstruct.py \
--config configs/vibetoken_ll.yaml \
--checkpoint ./checkpoints/VibeToken_LL.bin \
--image ./assets/example_1.png \
--output ./assets/reconstructed.png \
--encoder_patch_size 16 \
--decoder_patch_size 16
```
> **Note:** For best performance, the input image resolution should be a multiple of 32. Images with other resolutions are automatically rescaled to the nearest multiple of 32.
## 🎨 VibeToken-Gen: ImageNet-1k Generation
Download both the VibeToken-LL and VibeToken-Gen-XXL checkpoints (see [Checkpoints](#-checkpoints)), then:
```bash
python generate.py \
--gpt-ckpt ./checkpoints/VibeTokenGen-xxl-dynamic-65_750k.pt \
--gpt-model GPT-XXL --num-output-layer 4 \
--num-codebooks 8 --codebook-size 32768 \
--image-size 256 --cfg-scale 4.0 --top-k 500 --temperature 1.0 \
--class-dropout-prob 0.1 \
--extra-layers "QKV" \
--latent-size 65 \
--config ./configs/vibetoken_ll.yaml \
--vq-ckpt ./checkpoints/VibeToken_LL.bin \
--sample-dir ./assets/ \
--skip-folder-creation \
--compile \
--decoder-patch-size 32,32 \
--target-resolution 1024,1024 \
--llamagen-target-resolution 256,256 \
--precision bf16 \
--global-seed 156464151
```
The `--target-resolution` controls the tokenizer output resolution, while `--llamagen-target-resolution` controls the generator's internal resolution (max 512×512; for higher resolutions, the tokenizer handles upscaling).
## 🏋️ Training
To train the VibeToken tokenizer from scratch, please refer to [TRAIN.md](TRAIN.md) for detailed instructions.
## 🙏 Acknowledgement
We would like to acknowledge the following repositories that inspired our work and upon which we directly build:
[1d-tokenizer](https://github.com/bytedance/1d-tokenizer),
[LlamaGen](https://github.com/FoundationVision/LlamaGen), and
[UniTok](https://github.com/FoundationVision/UniTok).
## 📝 Citation
If you find VibeToken useful in your research, please consider citing:
```bibtex
@inproceedings{vibetoken2026,
title = {VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations},
author = {Patel, Maitreya and Li, Jingtao and Zhuang, Weiming and Yang, Yezhou and Lyu, Lingjuan},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}
```
If you have any questions, feel free to open an issue or reach out!