XuminYu
/

ViQ_weights

Model card Files Files and versions

ViQ_weights / README.md

XuminYu

init push

ff281ed 1 day ago

|

History Blame Contribute Delete

3.56 kB

	<p align="center">
	<img src="assets/hunyuan_logo.png" width="400"/> <br>
	</p>

	<div align="center">

	# ✨ ViQ Weights ✨
	### Text-Aligned Visual Quantized Representations at Any Resolution

	<p align="center">
	<a href="https://yuxumin.github.io/">Xumin Yu</a><sup>1,*</sup>&emsp;
	Zuyan Liu<sup>1,2,*</sup>&emsp;
	Zhenyu Yang<sup>1,2,*</sup>&emsp;
	Yuhao Dong<sup>3</sup>
	<br>
	Shengsheng Qian<sup>4</sup>&emsp;
	Jiwen Lu<sup>2</sup>&emsp;
	<a href="https://ancientmooner.github.io/">Han Hu</a><sup>1</sup>&emsp;
	<a href="https://raoyongming.github.io/">Yongming Rao</a><sup>1,†</sup>
	</p>

	[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-ViQ__weights-ffc107?color=ffc107&logoColor=white)](https://huggingface.co/XuminYu/ViQ_weights)

	[![GitHub](https://img.shields.io/badge/GitHub-ViQ-181717?logo=github)](https://github.com/yuxumin/ViQ)

	</div>

	---

	This repository hosts the pretrained model weights for ViQ. For the inference / training / weight-conversion code, see the main repo: https://github.com/yuxumin/ViQ.

	ViQ is trained in two stages, and this repository provides weights for both stages:

	\| Folder \| Stage \| What it is \|
	\| --- \| --- \| --- \|
	\| [`anyres_vit/`](anyres_vit) \| Stage 1 \| Text-aligned, any-resolution continuous SigLIP2 ViT encoders \|
	\| [`ViQ/`](ViQ) \| Stage 2 \| Discrete ViQ tokenizers (multiple FSQ codebook sizes) \|

	## 📦 `anyres_vit/` — Stage 1 (Any-Resolution ViT)

	The text-aligned, any-resolution ViT encoders produced after Stage 1 pre-training. Two backbone sizes are released:

	\| Size \| Backbone \| File \|
	\| --- \| --- \| --- \|
	\| 400M \| SigLIP2-SO400M \| `anyres_vit/so400m/siglip2_so400m_anyres_s4.pth` \|
	\| 1B \| SigLIP2-g \| `anyres_vit/giant1b/siglip2_g_anyres_s4.pth` \|

	## 🔢 `ViQ/` — Stage 2 (Discrete Tokenizers)

	The discretized ViQ tokenizers produced after Stage 2, released in several FSQ codebook sizes. Each `converted_<size>/` folder contains the ViQ-inference-format weights:

	\| Folder \| Codebook size \| FSQ levels \|
	\| --- \| --- \| --- \|
	\| `ViQ/converted_2k/` \| 2304 \| `[8, 8, 4, 3, 3]` \|
	\| `ViQ/converted_4k/` \| 4096 \| `[8, 8, 4, 4, 4]` \|
	\| `ViQ/converted_8k/` \| 8192 \| `[8, 8, 8, 4, 4]` \|
	\| `ViQ/converted_16k/` \| 15360 \| `[8, 8, 8, 6, 5]` \|
	\| `ViQ/converted_64k/` \| 64000 \| `[8, 8, 8, 5, 5, 5]` \|

	Each folder contains:

	```
	converted_<size>/
	├── model_viq_fsq_<size>.pth # ViQ encoder + Position-Aware FSQ head
	├── embedder.pth # discrete codes -> MLLM features
	└── index_drawer.pth # discrete codes -> VAE latent / reconstruction
	```

	## 🚀 Usage

	Clone the [code repo](https://github.com/yuxumin/ViQ), then point inference at the downloaded weights:

	```python
	from ViQ import load_viq

	vq = load_viq('16k')
	indices, sizes = vq.forward_indices(images) # encode -> discrete codes
	feats = vq.embedder(indices) # codes -> MLLM features
	_, vae_latent, recon_np = vq.drawer(indices, sizes) # codes -> reconstructed image
	```

	Download the weights with:

	```bash
	huggingface-cli download XuminYu/ViQ_weights --local-dir ViQ_weights
	```

	## 📚 Citation

	```bibtex
	@article{yu2026viq,
	title = {ViQ: Text-Aligned Visual Quantized Representations at Any Resolution},
	author = {Yu, Xumin and Liu, Zuyan and Yang, Zhenyu and Dong, Yuhao and Qian, Shengsheng and Lu, Jiwen and Hu, Han and Rao, Yongming},
	journal = {arXiv preprint arXiv:xxxx.xxxxx},
	year = {2026}
	}
	```