SimVQ / README.md

Improve model card for SimVQ

c3b3b57 verified 4 months ago

5.03 kB

	---
	license: mit
	pipeline_tag: image-to-image
	---

	# SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

	This repository contains the SimVQ model introduced in [Addressing Representation Collapse in Vector Quantized Models with One Linear Layer](https://huggingface.co/papers/2411.02038).

	SimVQ proposes a novel approach to overcome representation collapse in Vector Quantization (VQ) models, a common issue leading to low codebook utilization and limited scalability. Unlike existing solutions that rely on complex optimizations or reduced latent dimensionality, SimVQ reparameterizes code vectors through a learnable linear transformation layer over a latent basis. This simple yet effective method optimizes the entire linear space rather than individual code vectors, significantly improving codebook usage and generalizing across different modalities and architectures.

	Code: https://github.com/youngsheen/SimVQ

	## Algorithm for SimVQ

	The core code of SimVQ's quantization mechanism can be found in the [GitHub repository](https://github.com/youngsheen/SimVQ/blob/main/taming/modules/vqvae/quantize.py#L28-L33).

	<p align="center">
	<img src="https://github.com/youngsheen/SimVQ/raw/main/assets/Algorithm.png">
	</p>

	## Quantitative Comparison

	Table 1. Reconstruction performance of different tokenizers on $128 \times 128$ ImageNet 50k validation set.
	\| Method \| Codebook Size \| Codebook Utilization \| rFID \| LPIPS \| PSNR \| SSIM \| Checkpoint \|
	\|:------:\|:-------------:\|:----:\|:----:\|:---------------------:\|:----:\|:----:\|:----:\|
	\|VQGAN \| 65,536 \| 1.4% \| 3.74 \| 0.17 \| 22.20 \| 70.6 \| -\|
	\|VQGAN \| 65,536 \| 4.5% \| 3.23 \| 0.15 \| 22.89 \| 72.3 \| -\|
	\|VQGAN-FC \| 65,536 \| 100.0% \| 2.63 \| 0.13 \| 23.79 \| 77.5 \| - \|
	\|FSQ \| 64,000 \| 100.0% \| 2.80 \| 0.13\| 23.63 \| 75.8 \| - \|
	\|LFQ \| 65,536 \| 100.0% \| 2.88 \| 0.13\| 23.60 \| 77.2 \| - \|
	\|VQGAN-LC \| 65,536 \| 100.0% \| 2.40 \| 0.13 \| 23.98 \| 77.3 \| - \|
	\|SimVQ (ours) \| 1024 \| 100.0% \| 3.67 \| 0.16 \| 22.34 \| 70.8 \| [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_1k) \|
	\|SimVQ (ours) \| 8192 \| 100.0% \| 2.98 \| 0.14 \| 23.23 \| 74.7 \| [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_8k) \|
	\|SimVQ (ours) \| 65,536 \| 100.0% \| 2.24 \| 0.12 \| 24.15 \| 78.4 \| [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_65k) \|
	\|SimVQ (ours) \| 262,144 \| 100.0% \| 1.99 \| 0.11 \| 24.68 \| 80.3 \| [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_262k) \|

	Table 2. Reconstruction performance of different tokenizers on LibriTTS test clean/other set.

	\| Method \| Bandwidth \| Codebook Utilization \| UTMOS \| PESQ \| STOI \| V/UV F1 \| Checkpoint \|
	\|:------:\|:-------------:\|:----:\|:----:\|:---------------------:\|:----:\|:----:\|:----:\|
	\|Encodec \| 3.0kbps \| -/-% \| 2.31/2.09 \| 2.05/2.05 \| 0.90/0.88 \| 0.92/0.89 \| - \|
	\|Vocos \| 3.0kbps \| -/-% \| 3.53/3.06 \| 2.40/2.19 \| 0.92/0.90 \| 0.94/0.91 \| - \|
	\|SpeechTokenizer \| 3.0kbps \| -/-% \| 3.56/3.02 \| 1.93/1.74 \| 0.88/0.84 \| 0.93/0.89 \| - \|
	\|WavTokenizer \| 0.9kbps \| 100/100% \| 3.74/3.43 \| 2.01/2.26 \| 0.89/0.89 \| 0.92/0.92 \| - \|
	\|WavTokenizer \| 1.05kbps \| 27/-% \| 4.00/- \| 2.36/- \| 0.81/- \| 0.94/- \| - \|
	\|SimVQ (ours) \| 0.9kbps \| 100.0/100.0% \| 4.00/3.51 \| 2.33/2.08 \| 0.91/0.88 \| 0.94/0.91 \| [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_4k) \|
	\|SimVQ (ours) \| 0.975kbps \| 99.4/99.4% \| 4.03/3.52 \| 2.42/2.15 \| 0.92/0.88 \| 0.94/0.92 \| [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_8k) \|
	\|SimVQ (ours) \| 1.2kbps \| 99.4/99.0% \| 4.03/3.52 \| 2.54/2.26 \| 0.93/0.90 \| 0.94/0.92 \| [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_65k) \|
	\|SimVQ (ours) \| 1.35kbps \| 95.6/94.7% \| 4.03/3.53 \| 2.61/2.31 \| 0.93/0.90 \| 0.95/0.93 \| [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_262k) \|

	## Reconstruction Visualization

	Figure 2. Visualization of the Open-MAGVIT2 tokenizer trained at $128 \times 128$ resolution (`imagenet_simvq_128_Base` version). (a) indicates the original images while (b) specifies the reconstruction images.
	<p align="center">
	<img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_image.png">
	</p>

	Figure 3. Visualization of the Open-MAGVIT2 tokenizer trained at LibriTTS (`libritts_24khz` version). (a) indicates the original images while (b) specifies the reconstruction images.
	<p align="center">
	<img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_audio.png">
	</p>

	## Citation
	If you find our work helpful or inspiring, please feel free to cite it.
	```bibtex
	@misc{luo2024semievol,
	title={Addressing Representation Collapse in Vector Quantized Models with One Linear Layer},
	author={Yongxin Zhu and Bocheng Li and Hang Zhang and Xin Li and Linli Xu and Lidong Bing},
	year={2024},
	eprint={2411.02038},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2411.02038},
	}
	```