SimVQ / README.md
nielsr's picture
nielsr HF Staff
Improve model card for SimVQ: Add pipeline tag, paper, GitHub, and usage instructions
500c44f verified
|
raw
history blame
8.06 kB
metadata
license: mit
pipeline_tag: audio-to-audio

SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

arXiv

This repository contains the SimVQ model, presented in the paper Addressing Representation Collapse in Vector Quantized Models with One Linear Layer.

SimVQ tackles the problem of representation collapse in Vector Quantization (VQ) models by reparameterizing code vectors through a learnable linear transformation layer. This method optimizes the entire linear space of the codebook, leading to improved codebook utilization and robust performance across various modalities, including image and audio tasks.

Code is available at https://github.com/youngsheen/SimVQ.

Abstract

Vector Quantization (VQ) is essential for discretizing continuous representations in unsupervised learning but suffers from representation collapse, causing low codebook utilization and limiting scalability. Existing solutions often rely on complex optimizations or reduce latent dimensionality, which compromises model capacity and fails to fully solve the problem. We identify the root cause as disjoint codebook optimization, where only a few code vectors are updated via gradient descent. To fix this, we propose SimVQ, which reparameterizes code vectors through a learnable linear transformation layer over a latent basis, optimizing the entire linear space rather than nearest individual code vectors. Although the multiplication of two linear matrices is equivalent to applying a single linear layer, this simple approach effectively prevents collapse. Extensive experiments on image and audio tasks demonstrate that SimVQ improves codebook usage, is easy to implement, and generalizes well across modalities and architectures. The code is available at https://github.com/youngsheen/SimVQ.

Algorithm for SimVQ

You can find the core code here https://github.com/youngsheen/SimVQ/blob/main/taming/modules/vqvae/quantize.py#L28-L33

SimVQ Algorithm

Note: Optimizing both the codebook C and the linear layer W can work as well.

Quantitative Comparison

Table 1. Reconstruction performance of different tokenizers on $128 \times 128$ ImageNet 50k validation set.

Method Codebook Size Codebook Utilization rFID LPIPS PSNR SSIM Checkpoint
VQGAN 65,536 1.4% 3.74 0.17 22.20 70.6 -
VQGAN 65,536 4.5% 3.23 0.15 22.89 72.3 -
VQGAN-FC 65,536 100.0% 2.63 0.13 23.79 77.5 -
FSQ 64,000 100.0% 2.80 0.13 23.63 75.8 -
LFQ 65,536 100.0% 2.88 0.13 23.60 77.2 -
VQGAN-LC 65,536 100.0% 2.40 0.13 23.98 77.3 -
SimVQ (ours) 1024 100.0% 3.67 0.16 22.34 70.8 huggingface
SimVQ (ours) 8192 100.0% 2.98 0.14 23.23 74.7 huggingface
SimVQ (ours) 65,536 100.0% 2.24 0.12 24.15 78.4 huggingface
SimVQ (ours) 262,144 100.0% 1.99 0.11 24.68 80.3 huggingface

Table 2. Reconstruction performance of different tokenizers on LibriTTS test clean/other set.

Method Bandwidth Codebook Utilization UTMOS PESQ STOI V/UV F1 Checkpoint
Encodec 3.0kbps -/-% 2.31/2.09 2.05/2.05 0.90/0.88 0.92/0.89 -
Vocos 3.0kbps -/-% 3.53/3.06 2.40/2.19 0.92/0.90 0.94/0.91 -
SpeechTokenizer 3.0kbps -/-% 3.56/3.02 1.93/1.74 0.88/0.84 0.93/0.89 -
WavTokenizer 0.9kbps 100/100% 3.74/3.43 2.01/2.26 0.89/0.89 0.92/0.92 -
WavTokenizer 1.05kbps 27/-% 4.00/- 2.36/- 0.81/- 0.94/- -
SimVQ (ours) 0.9kbps 100.0/100.0% 4.00/3.51 2.33/2.08 0.91/0.88 0.94/0.91 huggingface
SimVQ (ours) 0.975kbps 99.4/99.4% 4.03/3.52 2.42/2.15 0.92/0.88 0.94/0.92 huggingface
SimVQ (ours) 1.2kbps 99.4/99.0% 4.03/3.52 2.54/2.26 0.93/0.90 0.94/0.92 huggingface
SimVQ (ours) 1.35kbps 95.6/94.7% 4.03/3.53 2.61/2.31 0.93/0.90 0.95/0.93 huggingface

Implementations

Installation

  • Dependencies: pip install -r requirements.txt
  • Extra dependencies for audio evaluation: pip install -r requirements_audio.txt
  • Datasets
imagenet
└── train/
    β”œβ”€β”€ n01440764
        β”œβ”€β”€ n01440764_10026.JPEG
        β”œβ”€β”€ n01440764_10027.JPEG
        β”œβ”€β”€ ...
    β”œβ”€β”€ n01443537
    β”œβ”€β”€ ...
└── val/
    β”œβ”€β”€ ...
LibriTTS
└── train-clean-100/
    β”œβ”€β”€ 103/
        β”œβ”€β”€ 1241/
            β”œβ”€β”€ 103_1241_000000_000001.wav
            β”œβ”€β”€ ...
    β”œβ”€β”€ 1034
    β”œβ”€β”€ ...
└── train-clean-360/
    β”œβ”€β”€ ...
└── train-other-500/
    β”œβ”€β”€ ...
└── dev-other/
    β”œβ”€β”€ ...
└── dev-clean/
    β”œβ”€β”€ ...
└── test-other/
    β”œβ”€β”€ ...
└── test-clean/
    β”œβ”€β”€ ...

Training Scripts

  • Image Tokenizer Training
XDG_CACHE_HOME="dataset/ILSVRC2012" python main.py fit --config configs/imagenet_simvq_128_B.yaml
  • Audio Tokenizer Training

You can get manifest .txt with generate_manifest.py

DATA_ROOT="/data3/yongxinzhu/libritts/LibriTTS" CUDA_VISIBLE_DEVICES=4,5,6,7 python main.py fit --config configs/libritts_24khz.yaml

Note: Some users have reported encountering NaN issues when training SimVQ on audio data. This appears to be a random occurrence, but we have found that using learning rate warmup can help mitigate the problem.

Evaluation Scripts

  • Image Tokenizer Evaluation
XDG_CACHE_HOME="dataset/ILSVRC2012" python evaluation.py --config_file vq_log/simvq_262k/size128/config.yaml --ckpt_path vq_log/simvq_262k/epoch=49-step=250250.ckpt
  • Audio Tokenizer Evaluation
DATA_ROOT="dataset/libritts" python evaluation_speech.py --config_file vq_audio_log/simvq_262k/1second/config.yaml --ckpt_path vq_audio_log/simvq_262k/epoch=49-step=138600.ckpt

Reconstruction Visualization

Figure 2. Visualization of the Open-MAGVIT2 tokenizer trained at $128 \times 128$ resolution (imagenet_simvq_128_Base version). (a) indicates the original images while (b) specifies the reconstruction images.

Image Reconstruction Visualization

Figure 3. Visualization of the Open-MAGVIT2 tokenizer trained at LibriTTS (libritts_24khz version). (a) indicates the original images while (b) specifies the reconstruction images.

Audio Reconstruction Visualization

Acknowledgement

The codebase of SimVQ is adapted from Open-MAGVIT2 and WavTokenizer. Thanks for their wonderful work.