SimVQ / README.md

nielsr HF Staff

Improve model card: Add pipeline tag, paper, code, abstract, quantitative results, sample usage, and visualizations

66d66ac verified 4 months ago

preview code

raw

history blame

7.57 kB

metadata

license: mit
pipeline_tag: image-to-image

SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

This repository contains the official implementation for SimVQ, a novel method presented in the paper Addressing Representation Collapse in Vector Quantized Models with One Linear Layer.

Code: https://github.com/youngsheen/SimVQ

Introduction

Vector Quantization (VQ) is essential for discretizing continuous representations in unsupervised learning but often suffers from representation collapse, leading to low codebook utilization and limited scalability. SimVQ addresses this by reparameterizing code vectors through a learnable linear transformation layer over a latent basis. This simple yet effective approach optimizes the entire linear space rather than nearest individual code vectors, effectively preventing collapse. Extensive experiments on image and audio tasks demonstrate that SimVQ improves codebook usage, is easy to implement, and generalizes well across modalities and architectures.

Algorithm for SimVQ

You can find the core code here: https://github.com/youngsheen/SimVQ/blob/main/taming/modules/vqvae/quantize.py#L28-L33

SimVQ Algorithm

Quantitative Comparison

Table 1. Reconstruction performance of different tokenizers on $128 \times 128$ ImageNet 50k validation set.

Method	Codebook Size	Codebook Utilization	rFID	LPIPS	PSNR	SSIM	Checkpoint
VQGAN	65,536	1.4%	3.74	0.17	22.20	70.6	-
VQGAN	65,536	4.5%	3.23	0.15	22.89	72.3	-
VQGAN-FC	65,536	100.0%	2.63	0.13	23.79	77.5	-
FSQ	64,000	100.0%	2.80	0.13	23.63	75.8	-
LFQ	65,536	100.0%	2.88	0.13	23.60	77.2	-
VQGAN-LC	65,536	100.0%	2.40	0.13	23.98	77.3	-
SimVQ (ours)	1024	100.0%	3.67	0.16	22.34	70.8	huggingface
SimVQ (ours)	8192	100.0%	2.98	0.14	23.23	74.7	huggingface
SimVQ (ours)	65,536	100.0%	2.24	0.12	24.15	78.4	huggingface
SimVQ (ours)	262,144	100.0%	1.99	0.11	24.68	80.3	huggingface

Table 2. Reconstruction performance of different tokenizers on LibriTTS test clean/other set.

Method	Bandwidth	Codebook Utilization	UTMOS	PESQ	STOI	V/UV F1	Checkpoint
Encodec	3.0kbps	-/-%	2.31/2.09	2.05/2.05	0.90/0.88	0.92/0.89	-
Vocos	3.0kbps	-/-%	3.53/3.06	2.40/2.19	0.92/0.90	0.94/0.91	-
SpeechTokenizer	3.0kbps	-/-%	3.56/3.02	1.93/1.74	0.88/0.84	0.93/0.89	-
WavTokenizer	0.9kbps	100/100%	3.74/3.43	2.01/2.26	0.89/0.89	0.92/0.92	-
WavTokenizer	1.05kbps	27/-%	4.00/-	2.36/-	0.81/-	0.94/-	-
SimVQ (ours)	0.9kbps	100.0/100.0%	4.00/3.51	2.33/2.08	0.91/0.88	0.94/0.91	huggingface
SimVQ (ours)	0.975kbps	99.4/99.4%	4.03/3.52	2.42/2.15	0.92/0.88	0.94/0.92	huggingface
SimVQ (ours)	1.2kbps	99.4/99.0%	4.03/3.52	2.54/2.26	0.93/0.90	0.94/0.92	huggingface
SimVQ (ours)	1.35kbps	95.6/94.7%	4.03/3.53	2.61/2.31	0.93/0.90	0.93/0.90	huggingface

Sample Usage

Installation

Dependencies: pip install -r requirements.txt
Extra dependencies for audio evaluation: pip install -r requirements_audio.txt

Datasets

The datasets should be structured as follows:

imagenet
└── train/
    ├── n01440764
        ├── n01440764_10026.JPEG
        ├── n01440764_10027.JPEG
        ├── ...
    ├── n01443537
    ├── ...
└── val/
    ├── ...

LibriTTS
└── train-clean-100/
    ├── 103/
        ├── 1241/
            ├── 103_1241_000000_000001.wav
            ├── ...
    ├── 1034
    ├── ...
└── train-clean-360/
    ├── ...
└── train-other-500/
    ├── ...
└── dev-other/
    ├── ...
└── dev-clean/
    ├── ...
└── test-other/
    ├── ...
└── test-clean/
    ├── ...

Training Scripts

Image Tokenizer Training

XDG_CACHE_HOME="dataset/ILSVRC2012" python main.py fit --config configs/imagenet_simvq_128_B.yaml

Audio Tokenizer Training You can get manifest .txt with generate_manifest.py
```
DATA_ROOT="/data3/yongxinzhu/libritts/LibriTTS" CUDA_VISIBLE_DEVICES=4,5,6,7 python main.py fit --config configs/libritts_24khz.yaml
```
Note: Some users have reported encountering NaN issues when training SimVQ on audio data. This appears to be a random occurrence, but we have found that using learning rate warmup can help mitigate the problem.

Evaluation Scripts

Image Tokenizer Evaluation

XDG_CACHE_HOME="dataset/ILSVRC2012" python evaluation.py --config_file vq_log/simvq_262k/size128/config.yaml --ckpt_path vq_log/simvq_262k/epoch=49-step=250250.ckpt

Audio Tokenizer Evaluation

DATA_ROOT="dataset/libritts" python evaluation_speech.py --config_file vq_audio_log/simvq_262k/1second/config.yaml --ckpt_path vq_audio_log/simvq_262k/epoch=49-step=138600.ckpt

Reconstruction Visualization

Figure 2. Visualization of the Open-MAGVIT2 tokenizer trained at $128 \times 128$ resolution (imagenet_simvq_128_Base version). (a) indicates the original images while (b) specifies the reconstruction images.

Image Reconstruction

Figure 3. Visualization of the Open-MAGVIT2 tokenizer trained at LibriTTS (libritts_24khz version). (a) indicates the original audio spectrograms while (b) specifies the reconstruction audio spectrograms.

Audio Reconstruction

Acknowledgement

The codebase of SimVQ is adapted from Open-MAGVIT2 and WavTokenizer. Thanks for their wonderful work.

Citation

If you find our work helpful or inspiring, please feel free to cite it.

@misc{zhu2024simvq,
      title={Addressing Representation Collapse in Vector Quantized Models with One Linear Layer},
      author={Yongxin Zhu and Dan Su and Liqiang He and Linli Xu and Lidong Bing},
      year={2024},
      eprint={2411.02038},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}