Enhance model card with description, links, pipeline tag, and implementation details

#3
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +113 -3
README.md CHANGED
@@ -1,3 +1,113 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: audio-to-audio
4
+ ---
5
+
6
+ # SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
7
+
8
+ This repository contains the **SimVQ** model, a novel approach to address representation collapse in Vector Quantization (VQ), as presented in the paper [Addressing Representation Collapse in Vector Quantized Models with One Linear Layer](https://huggingface.co/papers/2411.02038).
9
+
10
+ SimVQ reparameterizes code vectors through a learnable linear transformation layer over a latent basis, optimizing the *entire linear space* rather than nearest *individual code vectors*. This simple yet effective approach prevents representation collapse, leading to improved codebook usage and enhanced scalability. Extensive experiments on both image and audio tasks demonstrate that SimVQ improves codebook usage, is easy to implement, and generalizes well across modalities and architectures.
11
+
12
+ For the official implementation, code, training scripts, and further details, please visit the GitHub repository: [https://github.com/youngsheen/SimVQ](https://github.com/youngsheen/SimVQ).
13
+
14
+ ## Algorithm for SimVQ
15
+
16
+ You can find the core code here [https://github.com/youngsheen/SimVQ/blob/main/taming/modules/vqvae/quantize.py#L28-L33](https://github.com/youngsheen/SimVQ/blob/main/taming/modules/vqvae/quantize.py#L28-L33)
17
+
18
+ <p align="center">
19
+ <img src="https://github.com/youngsheen/SimVQ/raw/main/assets/Algorithm.png">
20
+ </p>
21
+
22
+ **Note:** Optimizing both the codebook C and the linear layer W can work as well.
23
+
24
+ ## Quantitative Comparison
25
+
26
+ **Table 1.** Reconstruction performance of different tokenizers on $128 \times 128$ ImageNet 50k validation set.
27
+ | Method | Codebook Size | Codebook Utilization | rFID | LPIPS | PSNR | SSIM | Checkpoint |
28
+ |:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:|
29
+ |VQGAN | 65,536 | 1.4% | 3.74 | 0.17 | 22.20 | 70.6 | -|
30
+ |VQGAN | 65,536 | 4.5% | 3.23 | 0.15 | 22.89 | 72.3 | -|
31
+ |VQGAN-FC | 65,536 | 100.0% | 2.63 | 0.13 | 23.79 | 77.5 | - |
32
+ |FSQ | 64,000 | 100.0% | 2.80 | 0.13| 23.63 | 75.8 | - |
33
+ |LFQ | 65,536 | 100.0% | 2.88 | 0.13| 23.60 | 77.2 | - |
34
+ |VQGAN-LC | 65,536 | 100.0% | 2.40 | 0.13 | 23.98 | 77.3 | - |
35
+ |SimVQ (ours) | 1024 | 100.0% | 3.67 | 0.16 | 22.34 | 70.8 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_1k) |
36
+ |SimVQ (ours) | 8192 | 100.0% | 2.98 | 0.14 | 23.23 | 74.7 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_8k) |
37
+ |SimVQ (ours) | 65,536 | 100.0% | **2.24** | **0.12** | **24.15** | **78.4** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_65k) |
38
+ |SimVQ (ours) | 262,144 | 100.0% | **1.99** | **0.11** | **24.68** | **80.3** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_262k) |
39
+
40
+ **Table 2.** Reconstruction performance of different tokenizers on LibriTTS test clean/other set.
41
+
42
+ | Method | Bandwidth | Codebook Utilization | UTMOS | PESQ | STOI | V/UV F1 | Checkpoint |
43
+ |:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:|
44
+ |Encodec | 3.0kbps | -/-% | 2.31/2.09 | 2.05/2.05 | 0.90/0.88 | 0.92/0.89 | - |
45
+ |Vocos | 3.0kbps | -/-% | 3.53/3.06 | 2.40/2.19 | 0.92/0.90 | 0.94/0.91 | - |
46
+ |SpeechTokenizer | 3.0kbps | -/-% | 3.56/3.02 | 1.93/1.74 | 0.88/0.84 | 0.93/0.89 | - |
47
+ |WavTokenizer | 0.9kbps | 100/100% | 3.74/3.43 | 2.01/2.26 | 0.89/0.89 | 0.92/0.92 | - |
48
+ |WavTokenizer | 1.05kbps | 27/-% | 4.00/- | 2.36/- | 0.81/- | 0.94/- | - |
49
+ |SimVQ (ours) | 0.9kbps | 100.0/100.0% | 4.00/3.51 | 2.33/2.08 | 0.91/0.88 | 0.94/0.91 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_4k) |
50
+ |SimVQ (ours) | 0.975kbps | 99.4/99.4% | 4.03/3.52 | 2.42/2.15 | 0.92/0.88 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_8k) |
51
+ |SimVQ (ours) | 1.2kbps | 99.4/99.0% | 4.03/3.52 | 2.54/2.26 | 0.93/0.90 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_65k) |
52
+ |SimVQ (ours) | 1.35kbps | 95.6/94.7% | 4.03/3.53 | 2.61/2.31 | 0.93/0.90 | 0.95/0.93 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_262k) |
53
+
54
+ ## Implementations
55
+
56
+ ### Installation
57
+
58
+ - **Dependencies**: `pip install -r requirements.txt`
59
+ - **Extra dependencies for audio evaluation**: `pip install -r requirements_audio.txt`
60
+
61
+ For detailed instructions on datasets, please refer to the [GitHub repository](https://github.com/youngsheen/SimVQ).
62
+
63
+ ### Training Scripts
64
+ * Image Tokenizer Training
65
+ ```bash
66
+ XDG_CACHE_HOME="dataset/ILSVRC2012" python main.py fit --config configs/imagenet_simvq_128_B.yaml
67
+ ```
68
+
69
+ * Audio Tokenizer Training
70
+ You can get manifest .txt with `generate_manifest.py`
71
+ ```bash
72
+ DATA_ROOT="/data3/yongxinzhu/libritts/LibriTTS" CUDA_VISIBLE_DEVICES=4,5,6,7 python main.py fit --config configs/libritts_24khz.yaml
73
+ ```
74
+
75
+ **Note:** Some users have reported encountering NaN issues when training SimVQ on audio data. This appears to be a random occurrence, but we have found that using learning rate warmup can help mitigate the problem.
76
+
77
+ ### Evaluation Scripts
78
+ * Image Tokenizer Evaluation
79
+ ```bash
80
+ XDG_CACHE_HOME="dataset/ILSVRC2012" python evaluation.py --config_file vq_log/simvq_262k/size128/config.yaml --ckpt_path vq_log/simvq_262k/epoch=49-step=250250.ckpt
81
+ ```
82
+
83
+ * Audio Tokenizer Evaluation
84
+ ```bash
85
+ DATA_ROOT="dataset/libritts" python evaluation_speech.py --config_file vq_audio_log/simvq_262k/1second/config.yaml --ckpt_path vq_audio_log/simvq_262k/epoch=49-step=138600.ckpt
86
+ ```
87
+
88
+ ## Reconstruction Visualization
89
+
90
+ **Figure 2.** Visualization of the Open-MAGVIT2 tokenizer trained at $128 \times 128$ resolution (`imagenet_simvq_128_Base` version). (a) indicates the original images while (b) specifies the reconstruction images.
91
+ <p align="center">
92
+ <img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_image.png">
93
+ </p>
94
+
95
+ **Figure 3.** Visualization of the Open-MAGVIT2 tokenizer trained at LibriTTS (`libritts_24khz` version). (a) indicates the original images while (b) specifies the reconstruction images.
96
+ <p align="center">
97
+ <img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_audio.png">
98
+ </p>
99
+
100
+ ## Acknowledgement
101
+ The codebase of SimVQ is adapted from [Open-MAGVIT2](https://github.com/TencentARC/Open-MAGVIT2) and [WavTokenizer](https://github.com/jishengpeng/WavTokenizer). Thanks for their wonderful work.
102
+
103
+ ## Citation
104
+ ```bibtex
105
+ @misc{zhu2024addressing,
106
+ title={Addressing Representation Collapse in Vector Quantized Models with One Linear Layer},
107
+ author={Yongxin Zhu and Shiying Su and Xingyu Li and Linli Xu and Lidong Bing},
108
+ year={2024},
109
+ eprint={2411.02038},
110
+ archivePrefix={arXiv},
111
+ primaryClass={cs.LG}
112
+ }
113
+ ```