# ✨ ViQ Weights ✨ ### Text-Aligned Visual Quantized Representations at Any Resolution

Xumin Yu1,*  Zuyan Liu1,2,*  Zhenyu Yang1,2,*  Yuhao Dong3
Shengsheng Qian4  Jiwen Lu2Han Hu1Yongming Rao1,†

[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-ViQ__weights-ffc107?color=ffc107&logoColor=white)](https://huggingface.co/XuminYu/ViQ_weights)    [![GitHub](https://img.shields.io/badge/GitHub-ViQ-181717?logo=github)](https://github.com/yuxumin/ViQ)
--- This repository hosts the **pretrained model weights** for **ViQ**. For the inference / training / weight-conversion **code**, see the main repo: **https://github.com/yuxumin/ViQ**. ViQ is trained in two stages, and this repository provides weights for **both stages**: | Folder | Stage | What it is | | --- | --- | --- | | [`anyres_vit/`](anyres_vit) | **Stage 1** | Text-aligned, any-resolution **continuous** SigLIP2 ViT encoders | | [`ViQ/`](ViQ) | **Stage 2** | **Discrete** ViQ tokenizers (multiple FSQ codebook sizes) | ## 📦 `anyres_vit/` — Stage 1 (Any-Resolution ViT) The text-aligned, any-resolution ViT encoders produced after **Stage 1** pre-training. Two backbone sizes are released: | Size | Backbone | File | | --- | --- | --- | | **400M** | SigLIP2-SO400M | `anyres_vit/so400m/siglip2_so400m_anyres_s4.pth` | | **1B** | SigLIP2-g | `anyres_vit/giant1b/siglip2_g_anyres_s4.pth` | ## 🔢 `ViQ/` — Stage 2 (Discrete Tokenizers) The discretized ViQ tokenizers produced after **Stage 2**, released in several FSQ **codebook sizes**. Each `converted_/` folder contains the ViQ-inference-format weights: | Folder | Codebook size | FSQ levels | | --- | --- | --- | | `ViQ/converted_2k/` | 2304 | `[8, 8, 4, 3, 3]` | | `ViQ/converted_4k/` | 4096 | `[8, 8, 4, 4, 4]` | | `ViQ/converted_8k/` | 8192 | `[8, 8, 8, 4, 4]` | | `ViQ/converted_16k/` | 15360 | `[8, 8, 8, 6, 5]` | | `ViQ/converted_64k/` | 64000 | `[8, 8, 8, 5, 5, 5]` | Each folder contains: ``` converted_/ ├── model_viq_fsq_.pth # ViQ encoder + Position-Aware FSQ head ├── embedder.pth # discrete codes -> MLLM features └── index_drawer.pth # discrete codes -> VAE latent / reconstruction ``` ## 🚀 Usage Clone the [code repo](https://github.com/yuxumin/ViQ), then point inference at the downloaded weights: ```python from ViQ import load_viq vq = load_viq('16k') indices, sizes = vq.forward_indices(images) # encode -> discrete codes feats = vq.embedder(indices) # codes -> MLLM features _, vae_latent, recon_np = vq.drawer(indices, sizes) # codes -> reconstructed image ``` Download the weights with: ```bash huggingface-cli download XuminYu/ViQ_weights --local-dir ViQ_weights ``` ## 📚 Citation ```bibtex @article{yu2026viq, title = {ViQ: Text-Aligned Visual Quantized Representations at Any Resolution}, author = {Yu, Xumin and Liu, Zuyan and Yang, Zhenyu and Dong, Yuhao and Qian, Shengsheng and Lu, Jiwen and Hu, Han and Rao, Yongming}, journal = {arXiv preprint arXiv:xxxx.xxxxx}, year = {2026} } ```