Improve model card for SimVQ

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +76 -3
README.md CHANGED
@@ -1,3 +1,76 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: image-to-image
4
+ ---
5
+
6
+ # SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
7
+
8
+ This repository contains the SimVQ model introduced in [Addressing Representation Collapse in Vector Quantized Models with One Linear Layer](https://huggingface.co/papers/2411.02038).
9
+
10
+ SimVQ proposes a novel approach to overcome representation collapse in Vector Quantization (VQ) models, a common issue leading to low codebook utilization and limited scalability. Unlike existing solutions that rely on complex optimizations or reduced latent dimensionality, SimVQ reparameterizes code vectors through a learnable linear transformation layer over a latent basis. This simple yet effective method optimizes the entire linear space rather than individual code vectors, significantly improving codebook usage and generalizing across different modalities and architectures.
11
+
12
+ Code: https://github.com/youngsheen/SimVQ
13
+
14
+ ## Algorithm for SimVQ
15
+
16
+ The core code of SimVQ's quantization mechanism can be found in the [GitHub repository](https://github.com/youngsheen/SimVQ/blob/main/taming/modules/vqvae/quantize.py#L28-L33).
17
+
18
+ <p align="center">
19
+ <img src="https://github.com/youngsheen/SimVQ/raw/main/assets/Algorithm.png">
20
+ </p>
21
+
22
+ ## Quantitative Comparison
23
+
24
+ **Table 1.** Reconstruction performance of different tokenizers on $128 \times 128$ ImageNet 50k validation set.
25
+ | Method | Codebook Size | Codebook Utilization | rFID | LPIPS | PSNR | SSIM | Checkpoint |
26
+ |:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:|
27
+ |VQGAN | 65,536 | 1.4% | 3.74 | 0.17 | 22.20 | 70.6 | -|
28
+ |VQGAN | 65,536 | 4.5% | 3.23 | 0.15 | 22.89 | 72.3 | -|
29
+ |VQGAN-FC | 65,536 | 100.0% | 2.63 | 0.13 | 23.79 | 77.5 | - |
30
+ |FSQ | 64,000 | 100.0% | 2.80 | 0.13| 23.63 | 75.8 | - |
31
+ |LFQ | 65,536 | 100.0% | 2.88 | 0.13| 23.60 | 77.2 | - |
32
+ |VQGAN-LC | 65,536 | 100.0% | 2.40 | 0.13 | 23.98 | 77.3 | - |
33
+ |SimVQ (ours) | 1024 | 100.0% | 3.67 | 0.16 | 22.34 | 70.8 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_1k) |
34
+ |SimVQ (ours) | 8192 | 100.0% | 2.98 | 0.14 | 23.23 | 74.7 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_8k) |
35
+ |SimVQ (ours) | 65,536 | 100.0% | **2.24** | **0.12** | **24.15** | **78.4** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_65k) |
36
+ |SimVQ (ours) | 262,144 | 100.0% | **1.99** | **0.11** | **24.68** | **80.3** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_262k) |
37
+
38
+ **Table 2.** Reconstruction performance of different tokenizers on LibriTTS test clean/other set.
39
+
40
+ | Method | Bandwidth | Codebook Utilization | UTMOS | PESQ | STOI | V/UV F1 | Checkpoint |
41
+ |:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:|
42
+ |Encodec | 3.0kbps | -/-% | 2.31/2.09 | 2.05/2.05 | 0.90/0.88 | 0.92/0.89 | - |
43
+ |Vocos | 3.0kbps | -/-% | 3.53/3.06 | 2.40/2.19 | 0.92/0.90 | 0.94/0.91 | - |
44
+ |SpeechTokenizer | 3.0kbps | -/-% | 3.56/3.02 | 1.93/1.74 | 0.88/0.84 | 0.93/0.89 | - |
45
+ |WavTokenizer | 0.9kbps | 100/100% | 3.74/3.43 | 2.01/2.26 | 0.89/0.89 | 0.92/0.92 | - |
46
+ |WavTokenizer | 1.05kbps | 27/-% | 4.00/- | 2.36/- | 0.81/- | 0.94/- | - |
47
+ |SimVQ (ours) | 0.9kbps | 100.0/100.0% | 4.00/3.51 | 2.33/2.08 | 0.91/0.88 | 0.94/0.91 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_4k) |
48
+ |SimVQ (ours) | 0.975kbps | 99.4/99.4% | 4.03/3.52 | 2.42/2.15 | 0.92/0.88 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_8k) |
49
+ |SimVQ (ours) | 1.2kbps | 99.4/99.0% | 4.03/3.52 | 2.54/2.26 | 0.93/0.90 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_65k) |
50
+ |SimVQ (ours) | 1.35kbps | 95.6/94.7% | 4.03/3.53 | 2.61/2.31 | 0.93/0.90 | 0.95/0.93 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_262k) |
51
+
52
+ ## Reconstruction Visualization
53
+
54
+ **Figure 2.** Visualization of the Open-MAGVIT2 tokenizer trained at $128 \times 128$ resolution (`imagenet_simvq_128_Base` version). (a) indicates the original images while (b) specifies the reconstruction images.
55
+ <p align="center">
56
+ <img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_image.png">
57
+ </p>
58
+
59
+ **Figure 3.** Visualization of the Open-MAGVIT2 tokenizer trained at LibriTTS (`libritts_24khz` version). (a) indicates the original images while (b) specifies the reconstruction images.
60
+ <p align="center">
61
+ <img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_audio.png">
62
+ </p>
63
+
64
+ ## Citation
65
+ If you find our work helpful or inspiring, please feel free to cite it.
66
+ ```bibtex
67
+ @misc{luo2024semievol,
68
+ title={Addressing Representation Collapse in Vector Quantized Models with One Linear Layer},
69
+ author={Yongxin Zhu and Bocheng Li and Hang Zhang and Xin Li and Linli Xu and Lidong Bing},
70
+ year={2024},
71
+ eprint={2411.02038},
72
+ archivePrefix={arXiv},
73
+ primaryClass={cs.CL},
74
+ url={https://arxiv.org/abs/2411.02038},
75
+ }
76
+ ```