Improve model card: Add pipeline tag, paper and code links, and comprehensive details
#7
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,3 +1,91 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
pipeline_tag: audio-to-audio
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
|
| 7 |
+
|
| 8 |
+
This repository contains the official implementation of **SimVQ**, a novel approach to Vector Quantization (VQ) that effectively addresses representation collapse. SimVQ reparameterizes code vectors through a learnable linear transformation layer, optimizing the *entire linear space* to prevent collapse and improve codebook utilization.
|
| 9 |
+
|
| 10 |
+
The model was presented in the paper:
|
| 11 |
+
[Addressing Representation Collapse in Vector Quantized Models with One Linear Layer](https://huggingface.co/papers/2411.02038)
|
| 12 |
+
|
| 13 |
+
The code for SimVQ is available on GitHub:
|
| 14 |
+
[https://github.com/youngsheen/SimVQ](https://github.com/youngsheen/SimVQ)
|
| 15 |
+
|
| 16 |
+
## Abstract
|
| 17 |
+
|
| 18 |
+
Vector Quantization (VQ) is essential for discretizing continuous representations in unsupervised learning but suffers from representation collapse, causing low codebook utilization and limiting scalability. Existing solutions often rely on complex optimizations or reduce latent dimensionality, which compromises model capacity and fails to fully solve the problem. We identify the root cause as disjoint codebook optimization, where only a few code vectors are updated via gradient descent. To fix this, we propose **SimVQ**, which reparameterizes code vectors through a learnable linear transformation layer over a latent basis, optimizing the *entire linear space* rather than nearest *individual code vectors*. Although the multiplication of two linear matrices is equivalent to applying a single linear layer, this simple approach effectively prevents collapse. Extensive experiments on image and audio tasks demonstrate that SimVQ improves codebook usage, is easy to implement, and generalizes well across modalities and architectures. The code is available at this https URL .
|
| 19 |
+
|
| 20 |
+
## Algorithm for SimVQ
|
| 21 |
+
|
| 22 |
+
You can find the core code here [https://github.com/youngsheen/SimVQ/blob/main/taming/modules/vqvae/quantize.py#L28-L33](https://github.com/youngsheen/SimVQ/blob/main/taming/modules/vqvae/quantize.py#L28-L33)
|
| 23 |
+
|
| 24 |
+
<p align="center">
|
| 25 |
+
<img src="https://github.com/youngsheen/SimVQ/raw/main/assets/Algorithm.png">
|
| 26 |
+
</p>
|
| 27 |
+
|
| 28 |
+
## Quantitative Comparison
|
| 29 |
+
|
| 30 |
+
**Table 1.** Reconstruction performance of different tokenizers on $128 \times 128$ ImageNet 50k validation set.
|
| 31 |
+
| Method | Codebook Size | Codebook Utilization | rFID | LPIPS | PSNR | SSIM | Checkpoint |
|
| 32 |
+
|:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:|
|
| 33 |
+
|VQGAN | 65,536 | 1.4% | 3.74 | 0.17 | 22.20 | 70.6 | -|
|
| 34 |
+
|VQGAN | 65,536 | 4.5% | 3.23 | 0.15 | 22.89 | 72.3 | -|
|
| 35 |
+
|VQGAN-FC | 65,536 | 100.0% | 2.63 | 0.13 | 23.79 | 77.5 | - |
|
| 36 |
+
|FSQ | 64,000 | 100.0% | 2.80 | 0.13| 23.63 | 75.8 | - |
|
| 37 |
+
|LFQ | 65,536 | 100.0% | 2.88 | 0.13| 23.60 | 77.2 | - |
|
| 38 |
+
|VQGAN-LC | 65,536 | 100.0% | 2.40 | 0.13 | 23.98 | 77.3 | - |
|
| 39 |
+
|SimVQ (ours) | 1024 | 100.0% | 3.67 | 0.16 | 22.34 | 70.8 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_1k) |
|
| 40 |
+
|SimVQ (ours) | 8192 | 100.0% | 2.98 | 0.14 | 23.23 | 74.7 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_8k) |
|
| 41 |
+
|SimVQ (ours) | 65,536 | 100.0% | **2.24** | **0.12** | **24.15** | **78.4** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_65k) |
|
| 42 |
+
|SimVQ (ours) | 262,144 | 100.0% | **1.99** | **0.11** | **24.68** | **80.3** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_262k) |
|
| 43 |
+
|
| 44 |
+
**Table 2.** Reconstruction performance of different tokenizers on LibriTTS test clean/other set.
|
| 45 |
+
|
| 46 |
+
| Method | Bandwidth | Codebook Utilization | UTMOS | PESQ | STOI | V/UV F1 | Checkpoint |
|
| 47 |
+
|:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:|
|
| 48 |
+
|Encodec | 3.0kbps | -/-% | 2.31/2.09 | 2.05/2.05 | 0.90/0.88 | 0.92/0.89 | - |
|
| 49 |
+
|Vocos | 3.0kbps | -/-% | 3.53/3.06 | 2.40/2.19 | 0.92/0.90 | 0.94/0.91 | - |
|
| 50 |
+
|SpeechTokenizer | 3.0kbps | -/-% | 3.56/3.02 | 1.93/1.74 | 0.88/0.84 | 0.93/0.89 | - |
|
| 51 |
+
|WavTokenizer | 0.9kbps | 100/100% | 3.74/3.43 | 2.01/2.26 | 0.89/0.89 | 0.92/0.92 | - |
|
| 52 |
+
|WavTokenizer | 1.05kbps | 27/-% | 4.00/- | 2.36/- | 0.81/- | 0.94/- | - |
|
| 53 |
+
|SimVQ (ours) | 0.9kbps | 100.0/100.0% | 4.00/3.51 | 2.33/2.08 | 0.91/0.88 | 0.94/0.91 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_4k) |
|
| 54 |
+
|SimVQ (ours) | 0.975kbps | 99.4/99.4% | 4.03/3.52 | 2.42/2.15 | 0.92/0.88 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_8k) |
|
| 55 |
+
|SimVQ (ours) | 1.2kbps | 99.4/99.0% | 4.03/3.52 | 2.54/2.26 | 0.93/0.90 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_65k) |
|
| 56 |
+
|SimVQ (ours) | 1.35kbps | 95.6/94.7% | 4.03/3.53 | 2.61/2.31 | 0.93/0.90 | 0.95/0.93 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_262k) |
|
| 57 |
+
|
| 58 |
+
## Reconstruction Visualization
|
| 59 |
+
|
| 60 |
+
**Figure 2.** Visualization of the Open-MAGVIT2 tokenizer trained at $128 \times 128$ resolution (`imagenet_simvq_128_Base` version). (a) indicates the original images while (b) specifies the reconstruction images.
|
| 61 |
+
<p align="center">
|
| 62 |
+
<img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_image.png">
|
| 63 |
+
</p>
|
| 64 |
+
|
| 65 |
+
**Figure 3.** Visualization of the Open-MAGVIT2 tokenizer trained at LibriTTS (`libritts_24khz` version). (a) indicates the original images while (b) specifies the reconstruction images.
|
| 66 |
+
<p align="center">
|
| 67 |
+
<img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_audio.png">
|
| 68 |
+
</p>
|
| 69 |
+
|
| 70 |
+
## Implementations
|
| 71 |
+
|
| 72 |
+
For detailed instructions on installation, dataset preparation, and training/evaluation scripts, please refer to the [GitHub repository](https://github.com/youngsheen/SimVQ) and its [Implementation section](https://github.com/youngsheen/SimVQ#implementations).
|
| 73 |
+
|
| 74 |
+
## Acknowledgement
|
| 75 |
+
|
| 76 |
+
The codebase of SimVQ is adapted from [Open-MAGVIT2](https://github.com/TencentARC/Open-MAGVIT2) and [WavTokenizer](https://github.com/jishengpeng/WavTokenizer). Thanks for their wonderful work.
|
| 77 |
+
|
| 78 |
+
## Citation
|
| 79 |
+
|
| 80 |
+
If you find our work helpful or inspiring, please feel free to cite it.
|
| 81 |
+
|
| 82 |
+
```bibtex
|
| 83 |
+
@misc{zhu2024addressing,
|
| 84 |
+
title={Addressing Representation Collapse in Vector Quantized Models with One Linear Layer},
|
| 85 |
+
author={Yongxin Zhu and Linli Xu and Lidong Bing},
|
| 86 |
+
year={2024},
|
| 87 |
+
eprint={2411.02038},
|
| 88 |
+
archivePrefix={arXiv},
|
| 89 |
+
primaryClass={cs.LG}
|
| 90 |
+
}
|
| 91 |
+
```
|