Improve model card for SimVQ
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,3 +1,76 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
pipeline_tag: image-to-image
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
|
| 7 |
+
|
| 8 |
+
This repository contains the SimVQ model introduced in [Addressing Representation Collapse in Vector Quantized Models with One Linear Layer](https://huggingface.co/papers/2411.02038).
|
| 9 |
+
|
| 10 |
+
SimVQ proposes a novel approach to overcome representation collapse in Vector Quantization (VQ) models, a common issue leading to low codebook utilization and limited scalability. Unlike existing solutions that rely on complex optimizations or reduced latent dimensionality, SimVQ reparameterizes code vectors through a learnable linear transformation layer over a latent basis. This simple yet effective method optimizes the entire linear space rather than individual code vectors, significantly improving codebook usage and generalizing across different modalities and architectures.
|
| 11 |
+
|
| 12 |
+
Code: https://github.com/youngsheen/SimVQ
|
| 13 |
+
|
| 14 |
+
## Algorithm for SimVQ
|
| 15 |
+
|
| 16 |
+
The core code of SimVQ's quantization mechanism can be found in the [GitHub repository](https://github.com/youngsheen/SimVQ/blob/main/taming/modules/vqvae/quantize.py#L28-L33).
|
| 17 |
+
|
| 18 |
+
<p align="center">
|
| 19 |
+
<img src="https://github.com/youngsheen/SimVQ/raw/main/assets/Algorithm.png">
|
| 20 |
+
</p>
|
| 21 |
+
|
| 22 |
+
## Quantitative Comparison
|
| 23 |
+
|
| 24 |
+
**Table 1.** Reconstruction performance of different tokenizers on $128 \times 128$ ImageNet 50k validation set.
|
| 25 |
+
| Method | Codebook Size | Codebook Utilization | rFID | LPIPS | PSNR | SSIM | Checkpoint |
|
| 26 |
+
|:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:|
|
| 27 |
+
|VQGAN | 65,536 | 1.4% | 3.74 | 0.17 | 22.20 | 70.6 | -|
|
| 28 |
+
|VQGAN | 65,536 | 4.5% | 3.23 | 0.15 | 22.89 | 72.3 | -|
|
| 29 |
+
|VQGAN-FC | 65,536 | 100.0% | 2.63 | 0.13 | 23.79 | 77.5 | - |
|
| 30 |
+
|FSQ | 64,000 | 100.0% | 2.80 | 0.13| 23.63 | 75.8 | - |
|
| 31 |
+
|LFQ | 65,536 | 100.0% | 2.88 | 0.13| 23.60 | 77.2 | - |
|
| 32 |
+
|VQGAN-LC | 65,536 | 100.0% | 2.40 | 0.13 | 23.98 | 77.3 | - |
|
| 33 |
+
|SimVQ (ours) | 1024 | 100.0% | 3.67 | 0.16 | 22.34 | 70.8 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_1k) |
|
| 34 |
+
|SimVQ (ours) | 8192 | 100.0% | 2.98 | 0.14 | 23.23 | 74.7 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_8k) |
|
| 35 |
+
|SimVQ (ours) | 65,536 | 100.0% | **2.24** | **0.12** | **24.15** | **78.4** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_65k) |
|
| 36 |
+
|SimVQ (ours) | 262,144 | 100.0% | **1.99** | **0.11** | **24.68** | **80.3** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_262k) |
|
| 37 |
+
|
| 38 |
+
**Table 2.** Reconstruction performance of different tokenizers on LibriTTS test clean/other set.
|
| 39 |
+
|
| 40 |
+
| Method | Bandwidth | Codebook Utilization | UTMOS | PESQ | STOI | V/UV F1 | Checkpoint |
|
| 41 |
+
|:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:|
|
| 42 |
+
|Encodec | 3.0kbps | -/-% | 2.31/2.09 | 2.05/2.05 | 0.90/0.88 | 0.92/0.89 | - |
|
| 43 |
+
|Vocos | 3.0kbps | -/-% | 3.53/3.06 | 2.40/2.19 | 0.92/0.90 | 0.94/0.91 | - |
|
| 44 |
+
|SpeechTokenizer | 3.0kbps | -/-% | 3.56/3.02 | 1.93/1.74 | 0.88/0.84 | 0.93/0.89 | - |
|
| 45 |
+
|WavTokenizer | 0.9kbps | 100/100% | 3.74/3.43 | 2.01/2.26 | 0.89/0.89 | 0.92/0.92 | - |
|
| 46 |
+
|WavTokenizer | 1.05kbps | 27/-% | 4.00/- | 2.36/- | 0.81/- | 0.94/- | - |
|
| 47 |
+
|SimVQ (ours) | 0.9kbps | 100.0/100.0% | 4.00/3.51 | 2.33/2.08 | 0.91/0.88 | 0.94/0.91 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_4k) |
|
| 48 |
+
|SimVQ (ours) | 0.975kbps | 99.4/99.4% | 4.03/3.52 | 2.42/2.15 | 0.92/0.88 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_8k) |
|
| 49 |
+
|SimVQ (ours) | 1.2kbps | 99.4/99.0% | 4.03/3.52 | 2.54/2.26 | 0.93/0.90 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_65k) |
|
| 50 |
+
|SimVQ (ours) | 1.35kbps | 95.6/94.7% | 4.03/3.53 | 2.61/2.31 | 0.93/0.90 | 0.95/0.93 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_262k) |
|
| 51 |
+
|
| 52 |
+
## Reconstruction Visualization
|
| 53 |
+
|
| 54 |
+
**Figure 2.** Visualization of the Open-MAGVIT2 tokenizer trained at $128 \times 128$ resolution (`imagenet_simvq_128_Base` version). (a) indicates the original images while (b) specifies the reconstruction images.
|
| 55 |
+
<p align="center">
|
| 56 |
+
<img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_image.png">
|
| 57 |
+
</p>
|
| 58 |
+
|
| 59 |
+
**Figure 3.** Visualization of the Open-MAGVIT2 tokenizer trained at LibriTTS (`libritts_24khz` version). (a) indicates the original images while (b) specifies the reconstruction images.
|
| 60 |
+
<p align="center">
|
| 61 |
+
<img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_audio.png">
|
| 62 |
+
</p>
|
| 63 |
+
|
| 64 |
+
## Citation
|
| 65 |
+
If you find our work helpful or inspiring, please feel free to cite it.
|
| 66 |
+
```bibtex
|
| 67 |
+
@misc{luo2024semievol,
|
| 68 |
+
title={Addressing Representation Collapse in Vector Quantized Models with One Linear Layer},
|
| 69 |
+
author={Yongxin Zhu and Bocheng Li and Hang Zhang and Xin Li and Linli Xu and Lidong Bing},
|
| 70 |
+
year={2024},
|
| 71 |
+
eprint={2411.02038},
|
| 72 |
+
archivePrefix={arXiv},
|
| 73 |
+
primaryClass={cs.CL},
|
| 74 |
+
url={https://arxiv.org/abs/2411.02038},
|
| 75 |
+
}
|
| 76 |
+
```
|