File size: 3,556 Bytes
ff281ed | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | <p align="center">
<img src="assets/hunyuan_logo.png" width="400"/> <br>
</p>
<div align="center">
# β¨ ViQ Weights β¨
### Text-Aligned Visual Quantized Representations at Any Resolution
<p align="center">
<a href="https://yuxumin.github.io/">Xumin Yu</a><sup>1,*</sup> 
Zuyan Liu<sup>1,2,*</sup> 
Zhenyu Yang<sup>1,2,*</sup> 
Yuhao Dong<sup>3</sup>
<br>
Shengsheng Qian<sup>4</sup> 
Jiwen Lu<sup>2</sup> 
<a href="https://ancientmooner.github.io/">Han Hu</a><sup>1</sup> 
<a href="https://raoyongming.github.io/">Yongming Rao</a><sup>1,β </sup>
</p>
[](https://huggingface.co/XuminYu/ViQ_weights)
[](https://github.com/yuxumin/ViQ)
</div>
---
This repository hosts the **pretrained model weights** for **ViQ**. For the inference / training / weight-conversion **code**, see the main repo: **https://github.com/yuxumin/ViQ**.
ViQ is trained in two stages, and this repository provides weights for **both stages**:
| Folder | Stage | What it is |
| --- | --- | --- |
| [`anyres_vit/`](anyres_vit) | **Stage 1** | Text-aligned, any-resolution **continuous** SigLIP2 ViT encoders |
| [`ViQ/`](ViQ) | **Stage 2** | **Discrete** ViQ tokenizers (multiple FSQ codebook sizes) |
## π¦ `anyres_vit/` β Stage 1 (Any-Resolution ViT)
The text-aligned, any-resolution ViT encoders produced after **Stage 1** pre-training. Two backbone sizes are released:
| Size | Backbone | File |
| --- | --- | --- |
| **400M** | SigLIP2-SO400M | `anyres_vit/so400m/siglip2_so400m_anyres_s4.pth` |
| **1B** | SigLIP2-g | `anyres_vit/giant1b/siglip2_g_anyres_s4.pth` |
## π’ `ViQ/` β Stage 2 (Discrete Tokenizers)
The discretized ViQ tokenizers produced after **Stage 2**, released in several FSQ **codebook sizes**. Each `converted_<size>/` folder contains the ViQ-inference-format weights:
| Folder | Codebook size | FSQ levels |
| --- | --- | --- |
| `ViQ/converted_2k/` | 2304 | `[8, 8, 4, 3, 3]` |
| `ViQ/converted_4k/` | 4096 | `[8, 8, 4, 4, 4]` |
| `ViQ/converted_8k/` | 8192 | `[8, 8, 8, 4, 4]` |
| `ViQ/converted_16k/` | 15360 | `[8, 8, 8, 6, 5]` |
| `ViQ/converted_64k/` | 64000 | `[8, 8, 8, 5, 5, 5]` |
Each folder contains:
```
converted_<size>/
βββ model_viq_fsq_<size>.pth # ViQ encoder + Position-Aware FSQ head
βββ embedder.pth # discrete codes -> MLLM features
βββ index_drawer.pth # discrete codes -> VAE latent / reconstruction
```
## π Usage
Clone the [code repo](https://github.com/yuxumin/ViQ), then point inference at the downloaded weights:
```python
from ViQ import load_viq
vq = load_viq('16k')
indices, sizes = vq.forward_indices(images) # encode -> discrete codes
feats = vq.embedder(indices) # codes -> MLLM features
_, vae_latent, recon_np = vq.drawer(indices, sizes) # codes -> reconstructed image
```
Download the weights with:
```bash
huggingface-cli download XuminYu/ViQ_weights --local-dir ViQ_weights
```
## π Citation
```bibtex
@article{yu2026viq,
title = {ViQ: Text-Aligned Visual Quantized Representations at Any Resolution},
author = {Yu, Xumin and Liu, Zuyan and Yang, Zhenyu and Dong, Yuhao and Qian, Shengsheng and Lu, Jiwen and Hu, Han and Rao, Yongming},
journal = {arXiv preprint arXiv:xxxx.xxxxx},
year = {2026}
}
``` |