Spaces:

APGASU
/

VibeToken

Running on Zero

App Files Files Community

VibeToken / README.md

APGASU

Update README.md

6163761 verified 20 days ago

preview code

raw

history blame contribute delete

7.47 kB

	---
	title: VibeToken
	emoji: 🦀
	colorFrom: blue
	colorTo: red
	sdk: gradio
	sdk_version: 6.6.0
	python_version: '3.12'
	app_file: app.py
	pinned: false
	license: mit
	---


	# [CVPR 2026] VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

	<p align="center">
	<img src="assets/teaser.png" alt="VibeToken Teaser" width="100%">
	</p>

	<p align="center">
	<b>CVPR 2026</b>  \|
	<a href="#">Paper</a>  \|
	<a href="#">Project Page</a>  \|
	<a href="#-checkpoints">Checkpoints</a>
	</p>

	<p align="center">
	<img src="https://img.shields.io/badge/CVPR-2026-blue" alt="CVPR 2026">
	<img src="https://img.shields.io/badge/arXiv-TODO-b31b1b" alt="arXiv">
	<img src="https://img.shields.io/badge/License-MIT-green" alt="License">
	<a href="https://huggingface.co/mpatel57/VibeToken"><img src="https://img.shields.io/badge/🤗-Model-yellow" alt="HuggingFace"></a>
	</p>

	---

	We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32--256 tokens, achieving state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources.

	### 🔥 Highlights

	\| \| \|
	\|---\|---\|
	\| 🎯 1024×1024 in just 64 tokens \| Achieves 3.94 gFID vs. 5.87 gFID for diffusion-based SOTA (1,024 tokens) \|
	\| ⚡ Constant 179G FLOPs \| 63× more efficient than LlamaGen (11T FLOPs at 1024×1024) \|
	\| 🌐 Resolution-agnostic \| Supports arbitrary resolutions and aspect ratios out of the box \|
	\| 🎛️ Dynamic token count \| User-controllable 32--256 tokens per image \|
	\| 🔍 Native super-resolution \| Supports image super-resolution out of the box \|


	## 📰 News

	- [Feb 2026] 🎉 VibeToken is accepted at CVPR 2026!
	- [Feb 2026] Training scripts released.
	- [Feb 2026] Inference code and checkpoints released.


	## 🚀 Quick Start

	```bash
	# 1. Clone and setup
	git clone https://github.com/<your-org>/VibeToken.git
	cd VibeToken
	uv venv --python=3.11.6
	source .venv/bin/activate
	uv pip install -r requirements.txt

	# 2. Download a checkpoint (see Checkpoints section below)
	mkdir -p checkpoints
	wget https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeToken_LL.bin -O ./checkpoints/VibeToken_LL.bin

	# 3. Reconstruct an image
	python reconstruct.py --auto \
	--config configs/vibetoken_ll.yaml \
	--checkpoint ./checkpoints/VibeToken_LL.bin \
	--image ./assets/example_1.png \
	--output ./assets/reconstructed.png
	```


	## 📦 Checkpoints

	All checkpoints are hosted on [Hugging Face](https://huggingface.co/mpatel57/VibeToken).

	#### Reconstruction Checkpoints

	\| Name \| Resolution \| rFID (256 tokens) \| rFID (64 tokens) \| Download \|
	\|------\|:----------:\|:-----------------:\|:----------------:\|----------\|
	\| VibeToken-LL \| 1024×1024 \| 3.76 \| 4.12 \| [VibeToken_LL.bin](https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeToken_LL.bin) \|
	\| VibeToken-LL \| 256×256 \| 5.12 \| 0.90 \| same as above \|
	\| VibeToken-SL \| 1024×1024 \| 4.25 \| 2.41 \| [VibeToken_SL.bin](https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeToken_SL.bin) \|
	\| VibeToken-SL \| 256×256 \| 5.44 \| 0.40 \| same as above \|

	#### Generation Checkpoints

	\| Name \| Training Resolution(s) \| Tokens \| Best gFID \| Download \|
	\|------\|:----------------------:\|:------:\|:---------:\|----------\|
	\| VibeToken-Gen-B \| 256×256 \| 65 \| 7.62 \| [VibeTokenGen-b-fixed65_dynamic_1500k.pt](https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeTokenGen-b-fixed65_dynamic_1500k.pt) \|
	\| VibeToken-Gen-B \| 1024×1024 \| 65 \| 7.37 \| same as above \|
	\| VibeToken-Gen-XXL \| 256×256 \| 65 \| 3.62 \| [VibeTokenGen-xxl-dynamic-65_750k.pt](https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeTokenGen-xxl-dynamic-65_750k.pt) \|
	\| VibeToken-Gen-XXL \| 1024×1024 \| 65 \| 3.54 \| same as above \|


	## 🛠️ Setup

	```bash
	uv venv --python=3.11.6
	source .venv/bin/activate
	uv pip install -r requirements.txt
	```

	> Tip: If you don't have `uv`, install it via `pip install uv` or see [uv docs](https://github.com/astral-sh/uv). Alternatively, use `python -m venv .venv && pip install -r requirements.txt`.


	## 🖼️ VibeToken Reconstruction

	Download the VibeToken-LL checkpoint (see [Checkpoints](#-checkpoints)), then:

	```bash
	# Auto mode (recommended) -- automatically determines optimal patch sizes
	python reconstruct.py --auto \
	--config configs/vibetoken_ll.yaml \
	--checkpoint ./checkpoints/VibeToken_LL.bin \
	--image ./assets/example_1.png \
	--output ./assets/reconstructed.png

	# Manual mode -- specify patch sizes explicitly
	python reconstruct.py \
	--config configs/vibetoken_ll.yaml \
	--checkpoint ./checkpoints/VibeToken_LL.bin \
	--image ./assets/example_1.png \
	--output ./assets/reconstructed.png \
	--encoder_patch_size 16 \
	--decoder_patch_size 16
	```

	> Note: For best performance, the input image resolution should be a multiple of 32. Images with other resolutions are automatically rescaled to the nearest multiple of 32.


	## 🎨 VibeToken-Gen: ImageNet-1k Generation

	Download both the VibeToken-LL and VibeToken-Gen-XXL checkpoints (see [Checkpoints](#-checkpoints)), then:

	```bash
	python generate.py \
	--gpt-ckpt ./checkpoints/VibeTokenGen-xxl-dynamic-65_750k.pt \
	--gpt-model GPT-XXL --num-output-layer 4 \
	--num-codebooks 8 --codebook-size 32768 \
	--image-size 256 --cfg-scale 4.0 --top-k 500 --temperature 1.0 \
	--class-dropout-prob 0.1 \
	--extra-layers "QKV" \
	--latent-size 65 \
	--config ./configs/vibetoken_ll.yaml \
	--vq-ckpt ./checkpoints/VibeToken_LL.bin \
	--sample-dir ./assets/ \
	--skip-folder-creation \
	--compile \
	--decoder-patch-size 32,32 \
	--target-resolution 1024,1024 \
	--llamagen-target-resolution 256,256 \
	--precision bf16 \
	--global-seed 156464151
	```

	The `--target-resolution` controls the tokenizer output resolution, while `--llamagen-target-resolution` controls the generator's internal resolution (max 512×512; for higher resolutions, the tokenizer handles upscaling).


	## 🏋️ Training

	To train the VibeToken tokenizer from scratch, please refer to [TRAIN.md](TRAIN.md) for detailed instructions.


	## 🙏 Acknowledgement

	We would like to acknowledge the following repositories that inspired our work and upon which we directly build:
	[1d-tokenizer](https://github.com/bytedance/1d-tokenizer),
	[LlamaGen](https://github.com/FoundationVision/LlamaGen), and
	[UniTok](https://github.com/FoundationVision/UniTok).


	## 📝 Citation

	If you find VibeToken useful in your research, please consider citing:

	```bibtex
	@inproceedings{vibetoken2026,
	title = {VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations},
	author = {Patel, Maitreya and Li, Jingtao and Zhuang, Weiming and Yang, Yezhou and Lyu, Lingjuan},
	booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
	year = {2026}
	}
	```

	If you have any questions, feel free to open an issue or reach out!