|
|
--- |
|
|
license: mit |
|
|
pipeline_tag: image-to-image |
|
|
--- |
|
|
|
|
|
# SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer |
|
|
|
|
|
This repository contains the SimVQ model introduced in [Addressing Representation Collapse in Vector Quantized Models with One Linear Layer](https://huggingface.co/papers/2411.02038). |
|
|
|
|
|
SimVQ proposes a novel approach to overcome representation collapse in Vector Quantization (VQ) models, a common issue leading to low codebook utilization and limited scalability. Unlike existing solutions that rely on complex optimizations or reduced latent dimensionality, SimVQ reparameterizes code vectors through a learnable linear transformation layer over a latent basis. This simple yet effective method optimizes the entire linear space rather than individual code vectors, significantly improving codebook usage and generalizing across different modalities and architectures. |
|
|
|
|
|
Code: https://github.com/youngsheen/SimVQ |
|
|
|
|
|
## Algorithm for SimVQ |
|
|
|
|
|
The core code of SimVQ's quantization mechanism can be found in the [GitHub repository](https://github.com/youngsheen/SimVQ/blob/main/taming/modules/vqvae/quantize.py#L28-L33). |
|
|
|
|
|
<p align="center"> |
|
|
<img src="https://github.com/youngsheen/SimVQ/raw/main/assets/Algorithm.png"> |
|
|
</p> |
|
|
|
|
|
## Quantitative Comparison |
|
|
|
|
|
**Table 1.** Reconstruction performance of different tokenizers on $128 \times 128$ ImageNet 50k validation set. |
|
|
| Method | Codebook Size | Codebook Utilization | rFID | LPIPS | PSNR | SSIM | Checkpoint | |
|
|
|:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:| |
|
|
|VQGAN | 65,536 | 1.4% | 3.74 | 0.17 | 22.20 | 70.6 | -| |
|
|
|VQGAN | 65,536 | 4.5% | 3.23 | 0.15 | 22.89 | 72.3 | -| |
|
|
|VQGAN-FC | 65,536 | 100.0% | 2.63 | 0.13 | 23.79 | 77.5 | - | |
|
|
|FSQ | 64,000 | 100.0% | 2.80 | 0.13| 23.63 | 75.8 | - | |
|
|
|LFQ | 65,536 | 100.0% | 2.88 | 0.13| 23.60 | 77.2 | - | |
|
|
|VQGAN-LC | 65,536 | 100.0% | 2.40 | 0.13 | 23.98 | 77.3 | - | |
|
|
|SimVQ (ours) | 1024 | 100.0% | 3.67 | 0.16 | 22.34 | 70.8 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_1k) | |
|
|
|SimVQ (ours) | 8192 | 100.0% | 2.98 | 0.14 | 23.23 | 74.7 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_8k) | |
|
|
|SimVQ (ours) | 65,536 | 100.0% | **2.24** | **0.12** | **24.15** | **78.4** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_65k) | |
|
|
|SimVQ (ours) | 262,144 | 100.0% | **1.99** | **0.11** | **24.68** | **80.3** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_262k) | |
|
|
|
|
|
**Table 2.** Reconstruction performance of different tokenizers on LibriTTS test clean/other set. |
|
|
|
|
|
| Method | Bandwidth | Codebook Utilization | UTMOS | PESQ | STOI | V/UV F1 | Checkpoint | |
|
|
|:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:| |
|
|
|Encodec | 3.0kbps | -/-% | 2.31/2.09 | 2.05/2.05 | 0.90/0.88 | 0.92/0.89 | - | |
|
|
|Vocos | 3.0kbps | -/-% | 3.53/3.06 | 2.40/2.19 | 0.92/0.90 | 0.94/0.91 | - | |
|
|
|SpeechTokenizer | 3.0kbps | -/-% | 3.56/3.02 | 1.93/1.74 | 0.88/0.84 | 0.93/0.89 | - | |
|
|
|WavTokenizer | 0.9kbps | 100/100% | 3.74/3.43 | 2.01/2.26 | 0.89/0.89 | 0.92/0.92 | - | |
|
|
|WavTokenizer | 1.05kbps | 27/-% | 4.00/- | 2.36/- | 0.81/- | 0.94/- | - | |
|
|
|SimVQ (ours) | 0.9kbps | 100.0/100.0% | 4.00/3.51 | 2.33/2.08 | 0.91/0.88 | 0.94/0.91 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_4k) | |
|
|
|SimVQ (ours) | 0.975kbps | 99.4/99.4% | 4.03/3.52 | 2.42/2.15 | 0.92/0.88 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_8k) | |
|
|
|SimVQ (ours) | 1.2kbps | 99.4/99.0% | 4.03/3.52 | 2.54/2.26 | 0.93/0.90 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_65k) | |
|
|
|SimVQ (ours) | 1.35kbps | 95.6/94.7% | 4.03/3.53 | 2.61/2.31 | 0.93/0.90 | 0.95/0.93 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_262k) | |
|
|
|
|
|
## Reconstruction Visualization |
|
|
|
|
|
**Figure 2.** Visualization of the Open-MAGVIT2 tokenizer trained at $128 \times 128$ resolution (`imagenet_simvq_128_Base` version). (a) indicates the original images while (b) specifies the reconstruction images. |
|
|
<p align="center"> |
|
|
<img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_image.png"> |
|
|
</p> |
|
|
|
|
|
**Figure 3.** Visualization of the Open-MAGVIT2 tokenizer trained at LibriTTS (`libritts_24khz` version). (a) indicates the original images while (b) specifies the reconstruction images. |
|
|
<p align="center"> |
|
|
<img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_audio.png"> |
|
|
</p> |
|
|
|
|
|
## Citation |
|
|
If you find our work helpful or inspiring, please feel free to cite it. |
|
|
```bibtex |
|
|
@misc{luo2024semievol, |
|
|
title={Addressing Representation Collapse in Vector Quantized Models with One Linear Layer}, |
|
|
author={Yongxin Zhu and Bocheng Li and Hang Zhang and Xin Li and Linli Xu and Lidong Bing}, |
|
|
year={2024}, |
|
|
eprint={2411.02038}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2411.02038}, |
|
|
} |
|
|
``` |