Improve model card: Add pipeline tag, paper, code, abstract, quantitative results, sample usage, and visualizations
#4
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,3 +1,150 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
pipeline_tag: image-to-image
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
|
| 7 |
+
|
| 8 |
+
This repository contains the official implementation for **SimVQ**, a novel method presented in the paper [Addressing Representation Collapse in Vector Quantized Models with One Linear Layer](https://huggingface.co/papers/2411.02038).
|
| 9 |
+
|
| 10 |
+
Code: [https://github.com/youngsheen/SimVQ](https://github.com/youngsheen/SimVQ)
|
| 11 |
+
|
| 12 |
+
## Introduction
|
| 13 |
+
Vector Quantization (VQ) is essential for discretizing continuous representations in unsupervised learning but often suffers from representation collapse, leading to low codebook utilization and limited scalability. SimVQ addresses this by reparameterizing code vectors through a learnable linear transformation layer over a latent basis. This simple yet effective approach optimizes the *entire linear space* rather than nearest *individual code vectors*, effectively preventing collapse. Extensive experiments on image and audio tasks demonstrate that SimVQ improves codebook usage, is easy to implement, and generalizes well across modalities and architectures.
|
| 14 |
+
|
| 15 |
+
## Algorithm for SimVQ
|
| 16 |
+
|
| 17 |
+
You can find the core code here: [https://github.com/youngsheen/SimVQ/blob/main/taming/modules/vqvae/quantize.py#L28-L33](https://github.com/youngsheen/SimVQ/blob/main/taming/modules/vqvae/quantize.py#L28-L33)
|
| 18 |
+
|
| 19 |
+
<p align="center">
|
| 20 |
+
<img src="https://github.com/youngsheen/SimVQ/raw/main/assets/Algorithm.png" alt="SimVQ Algorithm">
|
| 21 |
+
</p>
|
| 22 |
+
|
| 23 |
+
## Quantitative Comparison
|
| 24 |
+
|
| 25 |
+
**Table 1.** Reconstruction performance of different tokenizers on $128 \times 128$ ImageNet 50k validation set.
|
| 26 |
+
| Method | Codebook Size | Codebook Utilization | rFID | LPIPS | PSNR | SSIM | Checkpoint |
|
| 27 |
+
|:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:|
|
| 28 |
+
|VQGAN | 65,536 | 1.4% | 3.74 | 0.17 | 22.20 | 70.6 | -|
|
| 29 |
+
|VQGAN | 65,536 | 4.5% | 3.23 | 0.15 | 22.89 | 72.3 | -|
|
| 30 |
+
|VQGAN-FC | 65,536 | 100.0% | 2.63 | 0.13 | 23.79 | 77.5 | - |
|
| 31 |
+
|FSQ | 64,000 | 100.0% | 2.80 | 0.13| 23.63 | 75.8 | - |
|
| 32 |
+
|LFQ | 65,536 | 100.0% | 2.88 | 0.13| 23.60 | 77.2 | - |
|
| 33 |
+
|VQGAN-LC | 65,536 | 100.0% | 2.40 | 0.13 | 23.98 | 77.3 | - |
|
| 34 |
+
|SimVQ (ours) | 1024 | 100.0% | 3.67 | 0.16 | 22.34 | 70.8 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_1k) |
|
| 35 |
+
|SimVQ (ours) | 8192 | 100.0% | 2.98 | 0.14 | 23.23 | 74.7 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_8k) |
|
| 36 |
+
|SimVQ (ours) | 65,536 | 100.0% | **2.24** | **0.12** | **24.15** | **78.4** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_65k) |
|
| 37 |
+
|SimVQ (ours) | 262,144 | 100.0% | **1.99** | **0.11** | **24.68** | **80.3** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_262k) |
|
| 38 |
+
|
| 39 |
+
**Table 2.** Reconstruction performance of different tokenizers on LibriTTS test clean/other set.
|
| 40 |
+
|
| 41 |
+
| Method | Bandwidth | Codebook Utilization | UTMOS | PESQ | STOI | V/UV F1 | Checkpoint |
|
| 42 |
+
|:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:|
|
| 43 |
+
|Encodec | 3.0kbps | -/-% | 2.31/2.09 | 2.05/2.05 | 0.90/0.88 | 0.92/0.89 | - |
|
| 44 |
+
|Vocos | 3.0kbps | -/-% | 3.53/3.06 | 2.40/2.19 | 0.92/0.90 | 0.94/0.91 | - |
|
| 45 |
+
|SpeechTokenizer | 3.0kbps | -/-% | 3.56/3.02 | 1.93/1.74 | 0.88/0.84 | 0.93/0.89 | - |
|
| 46 |
+
|WavTokenizer | 0.9kbps | 100/100% | 3.74/3.43 | 2.01/2.26 | 0.89/0.89 | 0.92/0.92 | - |
|
| 47 |
+
|WavTokenizer | 1.05kbps | 27/-% | 4.00/- | 2.36/- | 0.81/- | 0.94/- | - |
|
| 48 |
+
|SimVQ (ours) | 0.9kbps | 100.0/100.0% | 4.00/3.51 | 2.33/2.08 | 0.91/0.88 | 0.94/0.91 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_4k) |
|
| 49 |
+
|SimVQ (ours) | 0.975kbps | 99.4/99.4% | 4.03/3.52 | 2.42/2.15 | 0.92/0.88 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_8k) |
|
| 50 |
+
|SimVQ (ours) | 1.2kbps | 99.4/99.0% | 4.03/3.52 | 2.54/2.26 | 0.93/0.90 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_65k) |
|
| 51 |
+
|SimVQ (ours) | 1.35kbps | 95.6/94.7% | 4.03/3.53 | 2.61/2.31 | 0.93/0.90 | 0.93/0.90 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_262k) |
|
| 52 |
+
|
| 53 |
+
## Sample Usage
|
| 54 |
+
|
| 55 |
+
### Installation
|
| 56 |
+
|
| 57 |
+
* **Dependencies**: `pip install -r requirements.txt`
|
| 58 |
+
* **Extra dependencies for audio evaluation**: `pip install -r requirements_audio.txt`
|
| 59 |
+
|
| 60 |
+
### Datasets
|
| 61 |
+
The datasets should be structured as follows:
|
| 62 |
+
|
| 63 |
+
```
|
| 64 |
+
imagenet
|
| 65 |
+
βββ train/
|
| 66 |
+
βββ n01440764
|
| 67 |
+
βββ n01440764_10026.JPEG
|
| 68 |
+
βββ n01440764_10027.JPEG
|
| 69 |
+
βββ ...
|
| 70 |
+
βββ n01443537
|
| 71 |
+
βββ ...
|
| 72 |
+
βββ val/
|
| 73 |
+
βββ ...
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
```
|
| 77 |
+
LibriTTS
|
| 78 |
+
βββ train-clean-100/
|
| 79 |
+
βββ 103/
|
| 80 |
+
βββ 1241/
|
| 81 |
+
βββ 103_1241_000000_000001.wav
|
| 82 |
+
βββ ...
|
| 83 |
+
βββ 1034
|
| 84 |
+
βββ ...
|
| 85 |
+
βββ train-clean-360/
|
| 86 |
+
βββ ...
|
| 87 |
+
βββ train-other-500/
|
| 88 |
+
βββ ...
|
| 89 |
+
βββ dev-other/
|
| 90 |
+
βββ ...
|
| 91 |
+
βββ dev-clean/
|
| 92 |
+
βββ ...
|
| 93 |
+
βββ test-other/
|
| 94 |
+
βββ ...
|
| 95 |
+
βββ test-clean/
|
| 96 |
+
βββ ...
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
### Training Scripts
|
| 100 |
+
* **Image Tokenizer Training**
|
| 101 |
+
```bash
|
| 102 |
+
XDG_CACHE_HOME="dataset/ILSVRC2012" python main.py fit --config configs/imagenet_simvq_128_B.yaml
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
* **Audio Tokenizer Training**
|
| 106 |
+
You can get manifest .txt with `generate_manifest.py`
|
| 107 |
+
```bash
|
| 108 |
+
DATA_ROOT="/data3/yongxinzhu/libritts/LibriTTS" CUDA_VISIBLE_DEVICES=4,5,6,7 python main.py fit --config configs/libritts_24khz.yaml
|
| 109 |
+
```
|
| 110 |
+
**Note:** Some users have reported encountering NaN issues when training SimVQ on audio data. This appears to be a random occurrence, but we have found that using learning rate warmup can help mitigate the problem.
|
| 111 |
+
|
| 112 |
+
### Evaluation Scripts
|
| 113 |
+
* **Image Tokenizer Evaluation**
|
| 114 |
+
```bash
|
| 115 |
+
XDG_CACHE_HOME="dataset/ILSVRC2012" python evaluation.py --config_file vq_log/simvq_262k/size128/config.yaml --ckpt_path vq_log/simvq_262k/epoch=49-step=250250.ckpt
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
* **Audio Tokenizer Evaluation**
|
| 119 |
+
```bash
|
| 120 |
+
DATA_ROOT="dataset/libritts" python evaluation_speech.py --config_file vq_audio_log/simvq_262k/1second/config.yaml --ckpt_path vq_audio_log/simvq_262k/epoch=49-step=138600.ckpt
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
## Reconstruction Visualization
|
| 124 |
+
|
| 125 |
+
**Figure 2.** Visualization of the Open-MAGVIT2 tokenizer trained at $128 \times 128$ resolution (`imagenet_simvq_128_Base` version). (a) indicates the original images while (b) specifies the reconstruction images.
|
| 126 |
+
<p align="center">
|
| 127 |
+
<img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_image.png" alt="Image Reconstruction">
|
| 128 |
+
</p>
|
| 129 |
+
|
| 130 |
+
**Figure 3.** Visualization of the Open-MAGVIT2 tokenizer trained at LibriTTS (`libritts_24khz` version). (a) indicates the original audio spectrograms while (b) specifies the reconstruction audio spectrograms.
|
| 131 |
+
<p align="center">
|
| 132 |
+
<img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_audio.png" alt="Audio Reconstruction">
|
| 133 |
+
</p>
|
| 134 |
+
|
| 135 |
+
## Acknowledgement
|
| 136 |
+
The codebase of SimVQ is adapted from [Open-MAGVIT2](https://github.com/TencentARC/Open-MAGVIT2) and [WavTokenizer](https://github.com/jishengpeng/WavTokenizer). Thanks for their wonderful work.
|
| 137 |
+
|
| 138 |
+
## Citation
|
| 139 |
+
If you find our work helpful or inspiring, please feel free to cite it.
|
| 140 |
+
|
| 141 |
+
```bibtex
|
| 142 |
+
@misc{zhu2024simvq,
|
| 143 |
+
title={Addressing Representation Collapse in Vector Quantized Models with One Linear Layer},
|
| 144 |
+
author={Yongxin Zhu and Dan Su and Liqiang He and Linli Xu and Lidong Bing},
|
| 145 |
+
year={2024},
|
| 146 |
+
eprint={2411.02038},
|
| 147 |
+
archivePrefix={arXiv},
|
| 148 |
+
primaryClass={cs.LG}
|
| 149 |
+
}
|
| 150 |
+
```
|