| <p align="center"> |
| <img src="assets/hunyuan_logo.png" width="400"/> <br> |
| </p> |
|
|
| <div align="center"> |
|
|
| # β¨ ViQ Weights β¨ |
| ### Text-Aligned Visual Quantized Representations at Any Resolution |
|
|
| <p align="center"> |
| <a href="https://yuxumin.github.io/">Xumin Yu</a><sup>1,*</sup>  |
| Zuyan Liu<sup>1,2,*</sup>  |
| Zhenyu Yang<sup>1,2,*</sup>  |
| Yuhao Dong<sup>3</sup> |
| <br> |
| Shengsheng Qian<sup>4</sup>  |
| Jiwen Lu<sup>2</sup>  |
| <a href="https://ancientmooner.github.io/">Han Hu</a><sup>1</sup>  |
| <a href="https://raoyongming.github.io/">Yongming Rao</a><sup>1,β </sup> |
| </p> |
| |
| [](https://huggingface.co/XuminYu/ViQ_weights) |
| |
| [](https://github.com/yuxumin/ViQ) |
|
|
| </div> |
|
|
| --- |
|
|
| This repository hosts the **pretrained model weights** for **ViQ**. For the inference / training / weight-conversion **code**, see the main repo: **https://github.com/yuxumin/ViQ**. |
|
|
| ViQ is trained in two stages, and this repository provides weights for **both stages**: |
|
|
| | Folder | Stage | What it is | |
| | --- | --- | --- | |
| | [`anyres_vit/`](anyres_vit) | **Stage 1** | Text-aligned, any-resolution **continuous** SigLIP2 ViT encoders | |
| | [`ViQ/`](ViQ) | **Stage 2** | **Discrete** ViQ tokenizers (multiple FSQ codebook sizes) | |
|
|
| ## π¦ `anyres_vit/` β Stage 1 (Any-Resolution ViT) |
| |
| The text-aligned, any-resolution ViT encoders produced after **Stage 1** pre-training. Two backbone sizes are released: |
| |
| | Size | Backbone | File | |
| | --- | --- | --- | |
| | **400M** | SigLIP2-SO400M | `anyres_vit/so400m/siglip2_so400m_anyres_s4.pth` | |
| | **1B** | SigLIP2-g | `anyres_vit/giant1b/siglip2_g_anyres_s4.pth` | |
| |
| ## π’ `ViQ/` β Stage 2 (Discrete Tokenizers) |
| |
| The discretized ViQ tokenizers produced after **Stage 2**, released in several FSQ **codebook sizes**. Each `converted_<size>/` folder contains the ViQ-inference-format weights: |
|
|
| | Folder | Codebook size | FSQ levels | |
| | --- | --- | --- | |
| | `ViQ/converted_2k/` | 2304 | `[8, 8, 4, 3, 3]` | |
| | `ViQ/converted_4k/` | 4096 | `[8, 8, 4, 4, 4]` | |
| | `ViQ/converted_8k/` | 8192 | `[8, 8, 8, 4, 4]` | |
| | `ViQ/converted_16k/` | 15360 | `[8, 8, 8, 6, 5]` | |
| | `ViQ/converted_64k/` | 64000 | `[8, 8, 8, 5, 5, 5]` | |
|
|
| Each folder contains: |
|
|
| ``` |
| converted_<size>/ |
| βββ model_viq_fsq_<size>.pth # ViQ encoder + Position-Aware FSQ head |
| βββ embedder.pth # discrete codes -> MLLM features |
| βββ index_drawer.pth # discrete codes -> VAE latent / reconstruction |
| ``` |
|
|
| ## π Usage |
|
|
| Clone the [code repo](https://github.com/yuxumin/ViQ), then point inference at the downloaded weights: |
|
|
| ```python |
| from ViQ import load_viq |
| |
| vq = load_viq('16k') |
| indices, sizes = vq.forward_indices(images) # encode -> discrete codes |
| feats = vq.embedder(indices) # codes -> MLLM features |
| _, vae_latent, recon_np = vq.drawer(indices, sizes) # codes -> reconstructed image |
| ``` |
|
|
| Download the weights with: |
|
|
| ```bash |
| huggingface-cli download XuminYu/ViQ_weights --local-dir ViQ_weights |
| ``` |
|
|
| ## π Citation |
|
|
| ```bibtex |
| @article{yu2026viq, |
| title = {ViQ: Text-Aligned Visual Quantized Representations at Any Resolution}, |
| author = {Yu, Xumin and Liu, Zuyan and Yang, Zhenyu and Dong, Yuhao and Qian, Shengsheng and Lu, Jiwen and Hu, Han and Rao, Yongming}, |
| journal = {arXiv preprint arXiv:xxxx.xxxxx}, |
| year = {2026} |
| } |
| ``` |