youngsheen
/

SimVQ

Model card Files Files and versions

xet

Community

Improve model card: Add audio pipeline tag and comprehensive details

by nielsr HF Staff - opened Oct 11, 2025

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+108

-3

Files changed (1) hide show

README.md +108 -3

README.md CHANGED Viewed

@@ -1,3 +1,108 @@
----
-license: mit
----

+---
+license: mit
+pipeline_tag: audio-to-audio
+---
+# SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
+<h5 align="center">
+[![arXiv](https://img.shields.io/badge/Arxiv-2411.02038-AD1C18.svg?logo=arXiv)](https://huggingface.co/papers/2411.02038)
+[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/youngsheen/SimVQ)
+</h5>
+## Introduction
+Vector Quantization (VQ) is essential for discretizing continuous representations in unsupervised learning but suffers from representation collapse, causing low codebook utilization and limiting scalability. Existing solutions often rely on complex optimizations or reduce latent dimensionality, which compromises model capacity and fails to fully solve the problem.
+This paper introduces **SimVQ**, a novel approach that reparameterizes code vectors through a learnable linear transformation layer over a latent basis, optimizing the *entire linear space* rather than nearest *individual code vectors*. Although the multiplication of two linear matrices is equivalent to applying a single linear layer, this simple approach effectively prevents collapse. Extensive experiments on image and audio tasks demonstrate that SimVQ improves codebook usage, is easy to implement, and generalizes well across modalities and architectures.
+For more details on the method and implementation, please refer to the official resources:
+- Paper: [Addressing Representation Collapse in Vector Quantized Models with One Linear Layer](https://huggingface.co/papers/2411.02038)
+- Code: [https://github.com/youngsheen/SimVQ](https://github.com/youngsheen/SimVQ)
+## Algorithm for SimVQ
+The core idea of SimVQ is to reparameterize code vectors through a learnable linear transformation layer over a latent basis.
+You can find the core code here: [https://github.com/youngsheen/SimVQ/blob/main/taming/modules/vqvae/quantize.py#L28-L33](https://github.com/youngsheen/SimVQ/blob/main/taming/modules/vqvae/quantize.py#L28-L33)
+<p align="center">
+<img src="https://github.com/youngsheen/SimVQ/raw/main/assets/Algorithm.png">
+</p>
+## Quantitative Comparison
+**Table 1.** Reconstruction performance of different tokenizers on $128 \times 128$ ImageNet 50k validation set.
+| Method | Codebook Size | Codebook Utilization | rFID | LPIPS | PSNR | SSIM | Checkpoint |
+|:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:|
+|VQGAN | 65,536 |  1.4% | 3.74 |  0.17 | 22.20 | 70.6 | -|
+|VQGAN | 65,536 |  4.5% | 3.23 |  0.15 | 22.89 | 72.3 | -|
+|VQGAN-FC | 65,536 | 100.0% | 2.63 | 0.13 | 23.79 | 77.5 | - |
+|FSQ | 64,000 | 100.0% | 2.80 | 0.13| 23.63 | 75.8 | - |
+|LFQ | 65,536 | 100.0% | 2.88 | 0.13| 23.60 | 77.2 | - |
+|VQGAN-LC | 65,536 | 100.0% | 2.40 | 0.13 | 23.98 | 77.3 | - |
+|SimVQ (ours) | 1024 | 100.0% | 3.67 | 0.16 | 22.34 | 70.8 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_1k) |
+|SimVQ (ours) | 8192 | 100.0% | 2.98 | 0.14 | 23.23 | 74.7 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_8k) |
+|SimVQ (ours) | 65,536 | 100.0% | **2.24** | **0.12** | **24.15** | **78.4** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_65k) |
+|SimVQ (ours) | 262,144 | 100.0% | **1.99** | **0.11** | **24.68** | **80.3** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_262k) |
+**Table 2.** Reconstruction performance of different tokenizers on LibriTTS test clean/other set.
+| Method | Bandwidth | Codebook Utilization | UTMOS | PESQ | STOI | V/UV F1 | Checkpoint |
+|:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:|
+|Encodec | 3.0kbps | -/-% | 2.31/2.09 | 2.05/2.05 | 0.90/0.88 | 0.92/0.89 | - |
+|Vocos | 3.0kbps | -/-% | 3.53/3.06 | 2.40/2.19 | 0.92/0.90 | 0.94/0.91 | - |
+|SpeechTokenizer | 3.0kbps | -/-% | 3.56/3.02 | 1.93/1.74 | 0.88/0.84 | 0.93/0.89 | - |
+|WavTokenizer | 0.9kbps | 100/100% | 3.74/3.43 | 2.01/2.26 | 0.89/0.89 | 0.92/0.92 | - |
+|WavTokenizer | 1.05kbps | 27/-% | 4.00/- | 2.36/- | 0.81/- | 0.94/- | - |
+|SimVQ (ours) | 0.9kbps | 100.0/100.0% | 4.00/3.51 | 2.33/2.08 | 0.91/0.88 | 0.94/0.91 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_4k) |
+|SimVQ (ours) | 0.975kbps | 99.4/99.4% | 4.03/3.52 | 2.42/2.15 | 0.92/0.88 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_8k) |
+|SimVQ (ours) | 1.2kbps | 99.4/99.0% | 4.03/3.52 | 2.54/2.26 | 0.93/0.90 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_65k) |
+|SimVQ (ours) | 1.35kbps | 95.6/94.7% | 4.03/3.53 | 2.61/2.31 | 0.93/0.90 | 0.95/0.93 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_262k) |
+## Implementations
+For detailed instructions on installation, training, and evaluation scripts, please refer to the [GitHub repository](https://github.com/youngsheen/SimVQ).
+### Installation
+- **Dependencies**: `pip install -r requirements.txt`
+- **Extra dependencies for audio evaluation**: `pip install -r requirements_audio.txt`
+### Training Scripts
+Example scripts are provided for:
+* Image Tokenizer Training (`configs/imagenet_simvq_128_B.yaml`)
+* Audio Tokenizer Training (`configs/libritts_24khz.yaml`)
+**Note:** Some users have reported encountering NaN issues when training SimVQ on audio data. This appears to be a random occurrence, but we have found that using learning rate warmup can help mitigate the problem.
+### Evaluation Scripts
+Example scripts are provided for:
+* Image Tokenizer Evaluation
+* Audio Tokenizer Evaluation
+## Reconstruction Visualization
+**Figure 2.** Visualization of the Open-MAGVIT2 tokenizer trained at $128 \times 128$ resolution (`imagenet_simvq_128_Base` version). (a) indicates the original images while (b) specifies the reconstruction images.
+<p align="center">
+    <img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_image.png">
+</p>
+**Figure 3.** Visualization of the Open-MAGVIT2 tokenizer trained at LibriTTS (`libritts_24khz` version). (a) indicates the original audio spectrograms while (b) specifies the reconstruction spectrograms.
+<p align="center">
+    <img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_audio.png">
+</p>
+## Acknowledgement
+The codebase of SimVQ is adapted from [Open-MAGVIT2](https://github.com/TencentARC/Open-MAGVIT2) and [WavTokenizer](https://github.com/jishengpeng/WavTokenizer). Thanks for their wonderful work.
+## Citation
+If you find our work helpful or inspiring, please feel free to cite it.
+```bibtex
+@misc{zhu2024addressing,
+      title={Addressing Representation Collapse in Vector Quantized Models with One Linear Layer},
+      author={Yongxin Zhu and Bocheng Li and Hang Zhang and Xin Li and Linli Xu and Lidong Bing},
+      year={2024},
+      eprint={2411.02038},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2411.02038},
+}
+```