SimVQ / README.md

nielsr HF Staff

Enhance model card for SimVQ with paper, code, pipeline tag, and key results

38c55bb verified 4 months ago

preview code

raw

history blame

5.38 kB

metadata

license: mit
pipeline_tag: image-to-image

SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

This repository contains the official implementation of SimVQ, a novel method introduced in the paper Addressing Representation Collapse in Vector Quantized Models with One Linear Layer.

SimVQ addresses the critical problem of representation collapse in Vector Quantization (VQ) models. It proposes a simple yet effective solution by reparameterizing code vectors through a learnable linear transformation layer over a latent basis, optimizing the entire linear space rather than just individual code vectors. This approach effectively prevents collapse, significantly improves codebook utilization, and generalizes well across various modalities and architectures, as demonstrated on both image and audio tasks.

Paper: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
Code: https://github.com/youngsheen/SimVQ

Algorithm for SimVQ

You can find the core code here https://github.com/youngsheen/SimVQ/blob/main/taming/modules/vqvae/quantize.py#L28-L33

SimVQ Algorithm

Quantitative Comparison

Table 1. Reconstruction performance of different tokenizers on $128 \times 128$ ImageNet 50k validation set.

Method	Codebook Size	Codebook Utilization	rFID	LPIPS	PSNR	SSIM	Checkpoint
VQGAN	65,536	1.4%	3.74	0.17	22.20	70.6	-
VQGAN	65,536	4.5%	3.23	0.15	22.89	72.3	-
VQGAN-FC	65,536	100.0%	2.63	0.13	23.79	77.5	-
FSQ	64,000	100.0%	2.80	0.13	23.63	75.8	-
LFQ	65,536	100.0%	2.88	0.13	23.60	77.2	-
VQGAN-LC	65,536	100.0%	2.40	0.13	23.98	77.3	-
SimVQ (ours)	1024	100.0%	3.67	0.16	22.34	70.8	huggingface
SimVQ (ours)	8192	100.0%	2.98	0.14	23.23	74.7	huggingface
SimVQ (ours)	65,536	100.0%	2.24	0.12	24.15	78.4	huggingface
SimVQ (ours)	262,144	100.0%	1.99	0.11	24.68	80.3	huggingface

Table 2. Reconstruction performance of different tokenizers on LibriTTS test clean/other set.

Method	Bandwidth	Codebook Utilization	UTMOS	PESQ	STOI	V/UV F1	Checkpoint
Encodec	3.0kbps	-/-%	2.31/2.09	2.05/2.05	0.90/0.88	0.92/0.89	-
Vocos	3.0kbps	-/-%	3.53/3.06	2.40/2.19	0.92/0.90	0.94/0.91	-
SpeechTokenizer	3.0kbps	-/-%	3.56/3.02	1.93/1.74	0.88/0.84	0.93/0.89	-
WavTokenizer	0.9kbps	100/100%	3.74/3.43	2.01/2.26	0.89/0.89	0.92/0.92	-
WavTokenizer	1.05kbps	27/-%	4.00/-	2.36/-	0.81/-	0.94/-	-
SimVQ (ours)	0.9kbps	100.0/100.0%	4.00/3.51	2.33/2.08	0.91/0.88	0.94/0.91	huggingface
SimVQ (ours)	0.975kbps	99.4/99.4%	4.03/3.52	2.42/2.15	0.92/0.88	0.94/0.92	huggingface
SimVQ (ours)	1.2kbps	99.4/99.0%	4.03/3.52	2.54/2.26	0.93/0.90	0.94/0.92	huggingface
SimVQ (ours)	1.35kbps	95.6/94.7%	4.03/3.53	2.61/2.31	0.93/0.90	0.95/0.93	huggingface

Reconstruction Visualization

Figure 2. Visualization of the Open-MAGVIT2 tokenizer trained at $128 \times 128$ resolution (imagenet_simvq_128_Base version). (a) indicates the original images while (b) specifies the reconstruction images.

Image Reconstruction Visualization

Figure 3. Visualization of the Open-MAGVIT2 tokenizer trained at LibriTTS (libritts_24khz version). (a) indicates the original images while (b) specifies the reconstruction images.

Audio Reconstruction Visualization

Acknowledgement

The codebase of SimVQ is adapted from Open-MAGVIT2 and WavTokenizer. Thanks for their wonderful work.

Citation

If you find our work useful, please consider citing our paper:

@misc{zhu2024addressing,
      title={Addressing Representation Collapse in Vector Quantized Models with One Linear Layer},
      author={Yongxin Zhu and Linli Xu},
      year={2024},
      eprint={2411.02038},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}