arxiv:2605.26089

Channel-wise Vector Quantization

Published on May 25

· Submitted by

Wei Song (SII) on May 26

Upvote

Authors:

Wei Song ,

Yitong Chen ,

Abstract

Channel-wise Vector Quantization replaces patch-wise tokens with channel-wise tokens in image tokenization, enabling a next-channel prediction framework that generates images by sequentially refining visual details.

AI-generated summary

We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventional vector quantization, which assigns a discrete token to each patch feature vector, CVQ quantizes each channel of the feature map. This formulation represents an image as discrete levels of visual details, rather than as a grid of spatial patches. Based on CVQ, we introduce a new visual autoregressive framework with "next-channel prediction". Instead of rendering images patch by patch in raster order, our Channel-wise Autoregressive (CAR) model predicts image channels sequentially, producing progressively enriched visual details. Specifically, it first sketches global structure and then refines fine-grained attributes, akin to a human artist's workflow. Empirically, we show that: (1) CVQ achieves 100% codebook utilization with a 16K+ codebook size without any bells and whistles, and substantially improves reconstruction quality over conventional VQ; and (2) CAR attains a DPG score of 86.7 and a GenEval score of 0.79, demonstrating strong effectiveness for text-to-image generation.

View arXiv page View PDF Project page GitHub 4 Add to collection

Community

Songweii

Paper author Paper submitter about 18 hours ago

avahal

about 1 hour ago

the channel-wise quantization idea is a surprisingly natural shift, turning image tokens into a 1d stream that lines up with how features in different channels carry information. my main question is what happens if you drop the nested channel dropout and let the model learn the coarse-to-fine order from data alone—do cross-channel dependencies get captured, or does the performance hinge on that training trick? btw, the arxivlens breakdown helped me parse the method details and the channel-wise stream, nicely clarifying how the 1d token pipeline fits with a standard transformer decoder: https://arxivlens.com/PaperView/Details/channel-wise-vector-quantization-8641-51a546ab

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.26089

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.26089 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.26089 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.26089 in a Space README.md to link it from this page.