Spaces:

APGASU
/

VibeToken

Running on Zero

App Files Files Community

VibeToken / README.md

APGASU

Update README.md

6163761 verified 19 days ago

preview code

raw

history blame contribute delete

7.47 kB

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

metadata

title: VibeToken
emoji: 🦀
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 6.6.0
python_version: '3.12'
app_file: app.py
pinned: false
license: mit

[CVPR 2026] VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

VibeToken Teaser

CVPR 2026 | Paper | Project Page | Checkpoints

We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32--256 tokens, achieving state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources.

🔥 Highlights


🎯 1024×1024 in just 64 tokens	Achieves 3.94 gFID vs. 5.87 gFID for diffusion-based SOTA (1,024 tokens)
⚡ Constant 179G FLOPs	63× more efficient than LlamaGen (11T FLOPs at 1024×1024)
🌐 Resolution-agnostic	Supports arbitrary resolutions and aspect ratios out of the box
🎛️ Dynamic token count	User-controllable 32--256 tokens per image
🔍 Native super-resolution	Supports image super-resolution out of the box

📰 News

[Feb 2026] 🎉 VibeToken is accepted at CVPR 2026!
[Feb 2026] Training scripts released.
[Feb 2026] Inference code and checkpoints released.

🚀 Quick Start

# 1. Clone and setup
git clone https://github.com/<your-org>/VibeToken.git
cd VibeToken
uv venv --python=3.11.6
source .venv/bin/activate
uv pip install -r requirements.txt

# 2. Download a checkpoint (see Checkpoints section below)
mkdir -p checkpoints
wget https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeToken_LL.bin -O ./checkpoints/VibeToken_LL.bin

# 3. Reconstruct an image
python reconstruct.py --auto \
  --config configs/vibetoken_ll.yaml \
  --checkpoint ./checkpoints/VibeToken_LL.bin \
  --image ./assets/example_1.png \
  --output ./assets/reconstructed.png

📦 Checkpoints

All checkpoints are hosted on Hugging Face.

Reconstruction Checkpoints

Name	Resolution	rFID (256 tokens)	rFID (64 tokens)	Download
VibeToken-LL	1024×1024	3.76	4.12	VibeToken_LL.bin
VibeToken-LL	256×256	5.12	0.90	same as above
VibeToken-SL	1024×1024	4.25	2.41	VibeToken_SL.bin
VibeToken-SL	256×256	5.44	0.40	same as above

Generation Checkpoints

Name	Training Resolution(s)	Tokens	Best gFID	Download
VibeToken-Gen-B	256×256	65	7.62	VibeTokenGen-b-fixed65_dynamic_1500k.pt
VibeToken-Gen-B	1024×1024	65	7.37	same as above
VibeToken-Gen-XXL	256×256	65	3.62	VibeTokenGen-xxl-dynamic-65_750k.pt
VibeToken-Gen-XXL	1024×1024	65	3.54	same as above

🛠️ Setup

uv venv --python=3.11.6
source .venv/bin/activate
uv pip install -r requirements.txt

Tip: If you don't have uv, install it via pip install uv or see uv docs. Alternatively, use python -m venv .venv && pip install -r requirements.txt.

🖼️ VibeToken Reconstruction

Download the VibeToken-LL checkpoint (see Checkpoints), then:

# Auto mode (recommended) -- automatically determines optimal patch sizes
python reconstruct.py --auto \
  --config configs/vibetoken_ll.yaml \
  --checkpoint ./checkpoints/VibeToken_LL.bin \
  --image ./assets/example_1.png \
  --output ./assets/reconstructed.png

# Manual mode -- specify patch sizes explicitly
python reconstruct.py \
  --config configs/vibetoken_ll.yaml \
  --checkpoint ./checkpoints/VibeToken_LL.bin \
  --image ./assets/example_1.png \
  --output ./assets/reconstructed.png \
  --encoder_patch_size 16 \
  --decoder_patch_size 16

Note: For best performance, the input image resolution should be a multiple of 32. Images with other resolutions are automatically rescaled to the nearest multiple of 32.

🎨 VibeToken-Gen: ImageNet-1k Generation

Download both the VibeToken-LL and VibeToken-Gen-XXL checkpoints (see Checkpoints), then:

python generate.py \
    --gpt-ckpt ./checkpoints/VibeTokenGen-xxl-dynamic-65_750k.pt \
    --gpt-model GPT-XXL --num-output-layer 4 \
    --num-codebooks 8 --codebook-size 32768 \
    --image-size 256 --cfg-scale 4.0 --top-k 500 --temperature 1.0 \
    --class-dropout-prob 0.1 \
    --extra-layers "QKV" \
    --latent-size 65 \
    --config ./configs/vibetoken_ll.yaml \
    --vq-ckpt ./checkpoints/VibeToken_LL.bin \
    --sample-dir ./assets/ \
    --skip-folder-creation \
    --compile \
    --decoder-patch-size 32,32 \
    --target-resolution 1024,1024 \
    --llamagen-target-resolution 256,256 \
    --precision bf16 \
    --global-seed 156464151

The --target-resolution controls the tokenizer output resolution, while --llamagen-target-resolution controls the generator's internal resolution (max 512×512; for higher resolutions, the tokenizer handles upscaling).

🏋️ Training

To train the VibeToken tokenizer from scratch, please refer to TRAIN.md for detailed instructions.

🙏 Acknowledgement

We would like to acknowledge the following repositories that inspired our work and upon which we directly build: 1d-tokenizer, LlamaGen, and UniTok.

📝 Citation

If you find VibeToken useful in your research, please consider citing:

@inproceedings{vibetoken2026,
  title     = {VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations},
  author    = {Patel, Maitreya and Li, Jingtao and Zhuang, Weiming and Yang, Yezhou and Lyu, Lingjuan},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

If you have any questions, feel free to open an issue or reach out!