nielsr HF Staff commited on
Commit
66d66ac
Β·
verified Β·
1 Parent(s): 5df9004

Improve model card: Add pipeline tag, paper, code, abstract, quantitative results, sample usage, and visualizations

Browse files

This PR significantly enhances the model card for the SimVQ model by incorporating comprehensive details from the paper and its official GitHub repository.

Key changes include:
* Adding the `pipeline_tag: image-to-image` to improve discoverability for image-related tasks on the Hugging Face Hub.
* Including direct links to the paper and the GitHub repository for easy access to research and code.
* Providing an introduction and algorithm overview, based on the paper's abstract, to explain the model's innovation.
* Adding quantitative comparison tables that showcase SimVQ's performance on both image (ImageNet) and audio (LibriTTS) tasks, along with links to checkpoints.
* Incorporating sample usage instructions, including installation, training, and evaluation scripts, directly from the GitHub README.
* Adding reconstruction visualizations for both image and audio to visually demonstrate the model's capabilities.
* Including acknowledgement and citation sections for proper attribution.

These updates aim to make the model card more informative, accessible, and aligned with Hugging Face Hub best practices.

Files changed (1) hide show
  1. README.md +150 -3
README.md CHANGED
@@ -1,3 +1,150 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: image-to-image
4
+ ---
5
+
6
+ # SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
7
+
8
+ This repository contains the official implementation for **SimVQ**, a novel method presented in the paper [Addressing Representation Collapse in Vector Quantized Models with One Linear Layer](https://huggingface.co/papers/2411.02038).
9
+
10
+ Code: [https://github.com/youngsheen/SimVQ](https://github.com/youngsheen/SimVQ)
11
+
12
+ ## Introduction
13
+ Vector Quantization (VQ) is essential for discretizing continuous representations in unsupervised learning but often suffers from representation collapse, leading to low codebook utilization and limited scalability. SimVQ addresses this by reparameterizing code vectors through a learnable linear transformation layer over a latent basis. This simple yet effective approach optimizes the *entire linear space* rather than nearest *individual code vectors*, effectively preventing collapse. Extensive experiments on image and audio tasks demonstrate that SimVQ improves codebook usage, is easy to implement, and generalizes well across modalities and architectures.
14
+
15
+ ## Algorithm for SimVQ
16
+
17
+ You can find the core code here: [https://github.com/youngsheen/SimVQ/blob/main/taming/modules/vqvae/quantize.py#L28-L33](https://github.com/youngsheen/SimVQ/blob/main/taming/modules/vqvae/quantize.py#L28-L33)
18
+
19
+ <p align="center">
20
+ <img src="https://github.com/youngsheen/SimVQ/raw/main/assets/Algorithm.png" alt="SimVQ Algorithm">
21
+ </p>
22
+
23
+ ## Quantitative Comparison
24
+
25
+ **Table 1.** Reconstruction performance of different tokenizers on $128 \times 128$ ImageNet 50k validation set.
26
+ | Method | Codebook Size | Codebook Utilization | rFID | LPIPS | PSNR | SSIM | Checkpoint |
27
+ |:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:|
28
+ |VQGAN | 65,536 | 1.4% | 3.74 | 0.17 | 22.20 | 70.6 | -|
29
+ |VQGAN | 65,536 | 4.5% | 3.23 | 0.15 | 22.89 | 72.3 | -|
30
+ |VQGAN-FC | 65,536 | 100.0% | 2.63 | 0.13 | 23.79 | 77.5 | - |
31
+ |FSQ | 64,000 | 100.0% | 2.80 | 0.13| 23.63 | 75.8 | - |
32
+ |LFQ | 65,536 | 100.0% | 2.88 | 0.13| 23.60 | 77.2 | - |
33
+ |VQGAN-LC | 65,536 | 100.0% | 2.40 | 0.13 | 23.98 | 77.3 | - |
34
+ |SimVQ (ours) | 1024 | 100.0% | 3.67 | 0.16 | 22.34 | 70.8 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_1k) |
35
+ |SimVQ (ours) | 8192 | 100.0% | 2.98 | 0.14 | 23.23 | 74.7 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_8k) |
36
+ |SimVQ (ours) | 65,536 | 100.0% | **2.24** | **0.12** | **24.15** | **78.4** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_65k) |
37
+ |SimVQ (ours) | 262,144 | 100.0% | **1.99** | **0.11** | **24.68** | **80.3** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_262k) |
38
+
39
+ **Table 2.** Reconstruction performance of different tokenizers on LibriTTS test clean/other set.
40
+
41
+ | Method | Bandwidth | Codebook Utilization | UTMOS | PESQ | STOI | V/UV F1 | Checkpoint |
42
+ |:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:|
43
+ |Encodec | 3.0kbps | -/-% | 2.31/2.09 | 2.05/2.05 | 0.90/0.88 | 0.92/0.89 | - |
44
+ |Vocos | 3.0kbps | -/-% | 3.53/3.06 | 2.40/2.19 | 0.92/0.90 | 0.94/0.91 | - |
45
+ |SpeechTokenizer | 3.0kbps | -/-% | 3.56/3.02 | 1.93/1.74 | 0.88/0.84 | 0.93/0.89 | - |
46
+ |WavTokenizer | 0.9kbps | 100/100% | 3.74/3.43 | 2.01/2.26 | 0.89/0.89 | 0.92/0.92 | - |
47
+ |WavTokenizer | 1.05kbps | 27/-% | 4.00/- | 2.36/- | 0.81/- | 0.94/- | - |
48
+ |SimVQ (ours) | 0.9kbps | 100.0/100.0% | 4.00/3.51 | 2.33/2.08 | 0.91/0.88 | 0.94/0.91 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_4k) |
49
+ |SimVQ (ours) | 0.975kbps | 99.4/99.4% | 4.03/3.52 | 2.42/2.15 | 0.92/0.88 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_8k) |
50
+ |SimVQ (ours) | 1.2kbps | 99.4/99.0% | 4.03/3.52 | 2.54/2.26 | 0.93/0.90 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_65k) |
51
+ |SimVQ (ours) | 1.35kbps | 95.6/94.7% | 4.03/3.53 | 2.61/2.31 | 0.93/0.90 | 0.93/0.90 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_262k) |
52
+
53
+ ## Sample Usage
54
+
55
+ ### Installation
56
+
57
+ * **Dependencies**: `pip install -r requirements.txt`
58
+ * **Extra dependencies for audio evaluation**: `pip install -r requirements_audio.txt`
59
+
60
+ ### Datasets
61
+ The datasets should be structured as follows:
62
+
63
+ ```
64
+ imagenet
65
+ └── train/
66
+ β”œβ”€β”€ n01440764
67
+ β”œβ”€β”€ n01440764_10026.JPEG
68
+ β”œβ”€β”€ n01440764_10027.JPEG
69
+ β”œβ”€β”€ ...
70
+ β”œβ”€β”€ n01443537
71
+ β”œβ”€β”€ ...
72
+ └── val/
73
+ β”œβ”€β”€ ...
74
+ ```
75
+
76
+ ```
77
+ LibriTTS
78
+ └── train-clean-100/
79
+ β”œβ”€β”€ 103/
80
+ β”œβ”€β”€ 1241/
81
+ β”œβ”€β”€ 103_1241_000000_000001.wav
82
+ β”œβ”€β”€ ...
83
+ β”œβ”€β”€ 1034
84
+ β”œβ”€β”€ ...
85
+ └── train-clean-360/
86
+ β”œβ”€β”€ ...
87
+ └── train-other-500/
88
+ β”œβ”€β”€ ...
89
+ └── dev-other/
90
+ β”œβ”€β”€ ...
91
+ └── dev-clean/
92
+ β”œβ”€β”€ ...
93
+ └── test-other/
94
+ β”œβ”€β”€ ...
95
+ └── test-clean/
96
+ β”œβ”€β”€ ...
97
+ ```
98
+
99
+ ### Training Scripts
100
+ * **Image Tokenizer Training**
101
+ ```bash
102
+ XDG_CACHE_HOME="dataset/ILSVRC2012" python main.py fit --config configs/imagenet_simvq_128_B.yaml
103
+ ```
104
+
105
+ * **Audio Tokenizer Training**
106
+ You can get manifest .txt with `generate_manifest.py`
107
+ ```bash
108
+ DATA_ROOT="/data3/yongxinzhu/libritts/LibriTTS" CUDA_VISIBLE_DEVICES=4,5,6,7 python main.py fit --config configs/libritts_24khz.yaml
109
+ ```
110
+ **Note:** Some users have reported encountering NaN issues when training SimVQ on audio data. This appears to be a random occurrence, but we have found that using learning rate warmup can help mitigate the problem.
111
+
112
+ ### Evaluation Scripts
113
+ * **Image Tokenizer Evaluation**
114
+ ```bash
115
+ XDG_CACHE_HOME="dataset/ILSVRC2012" python evaluation.py --config_file vq_log/simvq_262k/size128/config.yaml --ckpt_path vq_log/simvq_262k/epoch=49-step=250250.ckpt
116
+ ```
117
+
118
+ * **Audio Tokenizer Evaluation**
119
+ ```bash
120
+ DATA_ROOT="dataset/libritts" python evaluation_speech.py --config_file vq_audio_log/simvq_262k/1second/config.yaml --ckpt_path vq_audio_log/simvq_262k/epoch=49-step=138600.ckpt
121
+ ```
122
+
123
+ ## Reconstruction Visualization
124
+
125
+ **Figure 2.** Visualization of the Open-MAGVIT2 tokenizer trained at $128 \times 128$ resolution (`imagenet_simvq_128_Base` version). (a) indicates the original images while (b) specifies the reconstruction images.
126
+ <p align="center">
127
+ <img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_image.png" alt="Image Reconstruction">
128
+ </p>
129
+
130
+ **Figure 3.** Visualization of the Open-MAGVIT2 tokenizer trained at LibriTTS (`libritts_24khz` version). (a) indicates the original audio spectrograms while (b) specifies the reconstruction audio spectrograms.
131
+ <p align="center">
132
+ <img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_audio.png" alt="Audio Reconstruction">
133
+ </p>
134
+
135
+ ## Acknowledgement
136
+ The codebase of SimVQ is adapted from [Open-MAGVIT2](https://github.com/TencentARC/Open-MAGVIT2) and [WavTokenizer](https://github.com/jishengpeng/WavTokenizer). Thanks for their wonderful work.
137
+
138
+ ## Citation
139
+ If you find our work helpful or inspiring, please feel free to cite it.
140
+
141
+ ```bibtex
142
+ @misc{zhu2024simvq,
143
+ title={Addressing Representation Collapse in Vector Quantized Models with One Linear Layer},
144
+ author={Yongxin Zhu and Dan Su and Liqiang He and Linli Xu and Lidong Bing},
145
+ year={2024},
146
+ eprint={2411.02038},
147
+ archivePrefix={arXiv},
148
+ primaryClass={cs.LG}
149
+ }
150
+ ```