VibeToken / README.md
APGASU's picture
Update README.md
6163761 verified

A newer version of the Gradio SDK is available: 6.9.0

Upgrade
metadata
title: VibeToken
emoji: πŸ¦€
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 6.6.0
python_version: '3.12'
app_file: app.py
pinned: false
license: mit

[CVPR 2026] VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

VibeToken Teaser

CVPR 2026  |  Paper  |  Project Page  |  Checkpoints

CVPR 2026 arXiv License HuggingFace


We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32--256 tokens, achieving state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources.

πŸ”₯ Highlights

🎯 1024Γ—1024 in just 64 tokens Achieves 3.94 gFID vs. 5.87 gFID for diffusion-based SOTA (1,024 tokens)
⚑ Constant 179G FLOPs 63Γ— more efficient than LlamaGen (11T FLOPs at 1024Γ—1024)
🌐 Resolution-agnostic Supports arbitrary resolutions and aspect ratios out of the box
πŸŽ›οΈ Dynamic token count User-controllable 32--256 tokens per image
πŸ” Native super-resolution Supports image super-resolution out of the box

πŸ“° News

  • [Feb 2026] πŸŽ‰ VibeToken is accepted at CVPR 2026!
  • [Feb 2026] Training scripts released.
  • [Feb 2026] Inference code and checkpoints released.

πŸš€ Quick Start

# 1. Clone and setup
git clone https://github.com/<your-org>/VibeToken.git
cd VibeToken
uv venv --python=3.11.6
source .venv/bin/activate
uv pip install -r requirements.txt

# 2. Download a checkpoint (see Checkpoints section below)
mkdir -p checkpoints
wget https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeToken_LL.bin -O ./checkpoints/VibeToken_LL.bin

# 3. Reconstruct an image
python reconstruct.py --auto \
  --config configs/vibetoken_ll.yaml \
  --checkpoint ./checkpoints/VibeToken_LL.bin \
  --image ./assets/example_1.png \
  --output ./assets/reconstructed.png

πŸ“¦ Checkpoints

All checkpoints are hosted on Hugging Face.

Reconstruction Checkpoints

Name Resolution rFID (256 tokens) rFID (64 tokens) Download
VibeToken-LL 1024Γ—1024 3.76 4.12 VibeToken_LL.bin
VibeToken-LL 256Γ—256 5.12 0.90 same as above
VibeToken-SL 1024Γ—1024 4.25 2.41 VibeToken_SL.bin
VibeToken-SL 256Γ—256 5.44 0.40 same as above

Generation Checkpoints

Name Training Resolution(s) Tokens Best gFID Download
VibeToken-Gen-B 256Γ—256 65 7.62 VibeTokenGen-b-fixed65_dynamic_1500k.pt
VibeToken-Gen-B 1024Γ—1024 65 7.37 same as above
VibeToken-Gen-XXL 256Γ—256 65 3.62 VibeTokenGen-xxl-dynamic-65_750k.pt
VibeToken-Gen-XXL 1024Γ—1024 65 3.54 same as above

πŸ› οΈ Setup

uv venv --python=3.11.6
source .venv/bin/activate
uv pip install -r requirements.txt

Tip: If you don't have uv, install it via pip install uv or see uv docs. Alternatively, use python -m venv .venv && pip install -r requirements.txt.

πŸ–ΌοΈ VibeToken Reconstruction

Download the VibeToken-LL checkpoint (see Checkpoints), then:

# Auto mode (recommended) -- automatically determines optimal patch sizes
python reconstruct.py --auto \
  --config configs/vibetoken_ll.yaml \
  --checkpoint ./checkpoints/VibeToken_LL.bin \
  --image ./assets/example_1.png \
  --output ./assets/reconstructed.png

# Manual mode -- specify patch sizes explicitly
python reconstruct.py \
  --config configs/vibetoken_ll.yaml \
  --checkpoint ./checkpoints/VibeToken_LL.bin \
  --image ./assets/example_1.png \
  --output ./assets/reconstructed.png \
  --encoder_patch_size 16 \
  --decoder_patch_size 16

Note: For best performance, the input image resolution should be a multiple of 32. Images with other resolutions are automatically rescaled to the nearest multiple of 32.

🎨 VibeToken-Gen: ImageNet-1k Generation

Download both the VibeToken-LL and VibeToken-Gen-XXL checkpoints (see Checkpoints), then:

python generate.py \
    --gpt-ckpt ./checkpoints/VibeTokenGen-xxl-dynamic-65_750k.pt \
    --gpt-model GPT-XXL --num-output-layer 4 \
    --num-codebooks 8 --codebook-size 32768 \
    --image-size 256 --cfg-scale 4.0 --top-k 500 --temperature 1.0 \
    --class-dropout-prob 0.1 \
    --extra-layers "QKV" \
    --latent-size 65 \
    --config ./configs/vibetoken_ll.yaml \
    --vq-ckpt ./checkpoints/VibeToken_LL.bin \
    --sample-dir ./assets/ \
    --skip-folder-creation \
    --compile \
    --decoder-patch-size 32,32 \
    --target-resolution 1024,1024 \
    --llamagen-target-resolution 256,256 \
    --precision bf16 \
    --global-seed 156464151

The --target-resolution controls the tokenizer output resolution, while --llamagen-target-resolution controls the generator's internal resolution (max 512Γ—512; for higher resolutions, the tokenizer handles upscaling).

πŸ‹οΈ Training

To train the VibeToken tokenizer from scratch, please refer to TRAIN.md for detailed instructions.

πŸ™ Acknowledgement

We would like to acknowledge the following repositories that inspired our work and upon which we directly build: 1d-tokenizer, LlamaGen, and UniTok.

πŸ“ Citation

If you find VibeToken useful in your research, please consider citing:

@inproceedings{vibetoken2026,
  title     = {VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations},
  author    = {Patel, Maitreya and Li, Jingtao and Zhuang, Weiming and Yang, Yezhou and Lyu, Lingjuan},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

If you have any questions, feel free to open an issue or reach out!