license: mit
pipeline_tag: image-to-image
SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
This repository contains the SimVQ model introduced in Addressing Representation Collapse in Vector Quantized Models with One Linear Layer.
SimVQ proposes a novel approach to overcome representation collapse in Vector Quantization (VQ) models, a common issue leading to low codebook utilization and limited scalability. Unlike existing solutions that rely on complex optimizations or reduced latent dimensionality, SimVQ reparameterizes code vectors through a learnable linear transformation layer over a latent basis. This simple yet effective method optimizes the entire linear space rather than individual code vectors, significantly improving codebook usage and generalizing across different modalities and architectures.
Code: https://github.com/youngsheen/SimVQ
Algorithm for SimVQ
The core code of SimVQ's quantization mechanism can be found in the GitHub repository.
Quantitative Comparison
Table 1. Reconstruction performance of different tokenizers on $128 \times 128$ ImageNet 50k validation set.
| Method | Codebook Size | Codebook Utilization | rFID | LPIPS | PSNR | SSIM | Checkpoint |
|---|---|---|---|---|---|---|---|
| VQGAN | 65,536 | 1.4% | 3.74 | 0.17 | 22.20 | 70.6 | - |
| VQGAN | 65,536 | 4.5% | 3.23 | 0.15 | 22.89 | 72.3 | - |
| VQGAN-FC | 65,536 | 100.0% | 2.63 | 0.13 | 23.79 | 77.5 | - |
| FSQ | 64,000 | 100.0% | 2.80 | 0.13 | 23.63 | 75.8 | - |
| LFQ | 65,536 | 100.0% | 2.88 | 0.13 | 23.60 | 77.2 | - |
| VQGAN-LC | 65,536 | 100.0% | 2.40 | 0.13 | 23.98 | 77.3 | - |
| SimVQ (ours) | 1024 | 100.0% | 3.67 | 0.16 | 22.34 | 70.8 | huggingface |
| SimVQ (ours) | 8192 | 100.0% | 2.98 | 0.14 | 23.23 | 74.7 | huggingface |
| SimVQ (ours) | 65,536 | 100.0% | 2.24 | 0.12 | 24.15 | 78.4 | huggingface |
| SimVQ (ours) | 262,144 | 100.0% | 1.99 | 0.11 | 24.68 | 80.3 | huggingface |
Table 2. Reconstruction performance of different tokenizers on LibriTTS test clean/other set.
| Method | Bandwidth | Codebook Utilization | UTMOS | PESQ | STOI | V/UV F1 | Checkpoint |
|---|---|---|---|---|---|---|---|
| Encodec | 3.0kbps | -/-% | 2.31/2.09 | 2.05/2.05 | 0.90/0.88 | 0.92/0.89 | - |
| Vocos | 3.0kbps | -/-% | 3.53/3.06 | 2.40/2.19 | 0.92/0.90 | 0.94/0.91 | - |
| SpeechTokenizer | 3.0kbps | -/-% | 3.56/3.02 | 1.93/1.74 | 0.88/0.84 | 0.93/0.89 | - |
| WavTokenizer | 0.9kbps | 100/100% | 3.74/3.43 | 2.01/2.26 | 0.89/0.89 | 0.92/0.92 | - |
| WavTokenizer | 1.05kbps | 27/-% | 4.00/- | 2.36/- | 0.81/- | 0.94/- | - |
| SimVQ (ours) | 0.9kbps | 100.0/100.0% | 4.00/3.51 | 2.33/2.08 | 0.91/0.88 | 0.94/0.91 | huggingface |
| SimVQ (ours) | 0.975kbps | 99.4/99.4% | 4.03/3.52 | 2.42/2.15 | 0.92/0.88 | 0.94/0.92 | huggingface |
| SimVQ (ours) | 1.2kbps | 99.4/99.0% | 4.03/3.52 | 2.54/2.26 | 0.93/0.90 | 0.94/0.92 | huggingface |
| SimVQ (ours) | 1.35kbps | 95.6/94.7% | 4.03/3.53 | 2.61/2.31 | 0.93/0.90 | 0.95/0.93 | huggingface |
Reconstruction Visualization
Figure 2. Visualization of the Open-MAGVIT2 tokenizer trained at $128 \times 128$ resolution (imagenet_simvq_128_Base version). (a) indicates the original images while (b) specifies the reconstruction images.
Figure 3. Visualization of the Open-MAGVIT2 tokenizer trained at LibriTTS (libritts_24khz version). (a) indicates the original images while (b) specifies the reconstruction images.
Citation
If you find our work helpful or inspiring, please feel free to cite it.
@misc{luo2024semievol,
title={Addressing Representation Collapse in Vector Quantized Models with One Linear Layer},
author={Yongxin Zhu and Bocheng Li and Hang Zhang and Xin Li and Linli Xu and Lidong Bing},
year={2024},
eprint={2411.02038},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.02038},
}