youngsheen
/

SimVQ

Model card Files Files and versions

xet

Community

Improve model card: Add pipeline tag, paper/code links, and detailed information

by nielsr HF Staff - opened Oct 11, 2025

base: refs/heads/main

←

from: refs/pr/8

Discussion Files changed

+73

-3

Files changed (1) hide show

README.md +73 -3

README.md CHANGED Viewed

@@ -1,3 +1,73 @@
----
-license: mit
----

+---
+license: mit
+pipeline_tag: image-to-image
+---
+# SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
+This repository contains the official implementation of **SimVQ**, a novel approach presented in the paper "[Addressing Representation Collapse in Vector Quantized Models with One Linear Layer](https://huggingface.co/papers/2411.02038)".
+Vector Quantization (VQ) is crucial for discretizing continuous representations in unsupervised learning but often suffers from representation collapse, leading to low codebook utilization and limited scalability. SimVQ identifies the root cause as disjoint codebook optimization and proposes reparameterizing code vectors through a learnable linear transformation layer over a latent basis. This simple yet effective approach optimizes the *entire linear space* instead of just nearest *individual code vectors*, thereby preventing collapse. Extensive experiments on image and audio tasks demonstrate that SimVQ improves codebook usage, is easy to implement, and generalizes well across modalities and architectures.
+The official code, detailed implementation, and further information can be found on the [GitHub repository](https://github.com/youngsheen/SimVQ).
+## Quantitative Comparison
+SimVQ demonstrates strong performance in both image and audio reconstruction tasks, as shown in the tables below.
+**Table 1.** Reconstruction performance of different tokenizers on $128 \times 128$ ImageNet 50k validation set.
+| Method | Codebook Size | Codebook Utilization | rFID | LPIPS | PSNR | SSIM | Checkpoint |
+|:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:|
+|VQGAN | 65,536 |  1.4% | 3.74 |  0.17 | 22.20 | 70.6 | -|
+|VQGAN | 65,536 |  4.5% | 3.23 |  0.15 | 22.89 | 72.3 | -|
+|VQGAN-FC | 65,536 | 100.0% | 2.63 | 0.13 | 23.79 | 77.5 | - |
+|FSQ | 64,000 | 100.0% | 2.80 | 0.13| 23.63 | 75.8 | - |
+|LFQ | 65,536 | 100.0% | 2.88 | 0.13| 23.60 | 77.2 | - |
+|VQGAN-LC | 65,536 | 100.0% | 2.40 | 0.13 | 23.98 | 77.3 | - |
+|SimVQ (ours) | 1024 | 100.0% | 3.67 | 0.16 | 22.34 | 70.8 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_1k) |
+|SimVQ (ours) | 8192 | 100.0% | 2.98 | 0.14 | 23.23 | 74.7 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_8k) |
+|SimVQ (ours) | 65,536 | 100.0% | **2.24** | **0.12** | **24.15** | **78.4** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_65k) |
+|SimVQ (ours) | 262,144 | 100.0% | **1.99** | **0.11** | **24.68** | **80.3** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_262k) |
+**Table 2.** Reconstruction performance of different tokenizers on LibriTTS test clean/other set.
+| Method | Bandwidth | Codebook Utilization | UTMOS | PESQ | STOI | V/UV F1 | Checkpoint |
+|:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:|
+|Encodec | 3.0kbps | -/-% | 2.31/2.09 | 2.05/2.05 | 0.90/0.88 | 0.92/0.89 | - |
+|Vocos | 3.0kbps | -/-% | 3.53/3.06 | 2.40/2.19 | 0.92/0.90 | 0.94/0.91 | - |
+|SpeechTokenizer | 3.0kbps | -/-% | 3.56/3.02 | 1.93/1.74 | 0.88/0.84 | 0.93/0.89 | - |
+|WavTokenizer | 0.9kbps | 100/100% | 3.74/3.43 | 2.01/2.26 | 0.89/0.89 | 0.92/0.92 | - |
+|WavTokenizer | 1.05kbps | 27/-% | 4.00/- | 2.36/- | 0.81/- | 0.94/- | - |
+|SimVQ (ours) | 0.9kbps | 100.0/100.0% | 4.00/3.51 | 2.33/2.08 | 0.91/0.88 | 0.94/0.91 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_4k) |
+|SimVQ (ours) | 0.975kbps | 99.4/99.4% | 4.03/3.52 | 2.42/2.15 | 0.92/0.88 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_8k) |
+|SimVQ (ours) | 1.2kbps | 99.4/99.0% | 4.03/3.52 | 2.54/2.26 | 0.93/0.90 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_65k) |
+|SimVQ (ours) | 1.35kbps | 95.6/94.7% | 4.03/3.53 | 2.61/2.31 | 0.93/0.90 | 0.95/0.93 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_262k) |
+## Reconstruction Visualization
+**Figure 2.** Visualization of the Open-MAGVIT2 tokenizer trained at $128 \times 128$ resolution (`imagenet_simvq_128_Base` version). (a) indicates the original images while (b) specifies the reconstruction images.
+<p align="center">
+    <img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_image.png">
+</p>
+**Figure 3.** Visualization of the Open-MAGVIT2 tokenizer trained at LibriTTS (`libritts_24khz` version). (a) indicates the original images while (b) specifies the reconstruction images.
+<p align="center">
+    <img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_audio.png">
+</p>
+## Citation
+If you find this work useful, please consider citing the paper:
+```bibtex
+@misc{luo2024semievol,
+    title={SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer},
+    author={Yongxin Zhu and Bocheng Li and Weifeng Lin and Hang Zhang and Lijun Yan and Xin Li and Linli Xu and Lidong Bing},
+    year={2024},
+    eprint={2411.02038},
+    archivePrefix={arXiv},
+    primaryClass={cs.LG},
+    url={https://arxiv.org/abs/2411.02038},
+}
+```