Enhance model card for SimVQ with paper, code, pipeline tag, and key results

This PR significantly enhances the model card for **SimVQ** by adding essential information and improving its discoverability on the Hugging Face Hub.

Key updates include:
- **Metadata**: Added `pipeline_tag: image-to-image` to categorize the model for relevant tasks. The existing `license: mit` is retained.
- **Model Description**: Incorporated a summary of the model from the paper's abstract, explaining its purpose and approach.
- **Links**: Provided direct links to the scientific paper ([Addressing Representation Collapse in Vector Quantized Models with One Linear Layer](https://huggingface.co/papers/2411.02038)) and the official GitHub repository (https://github.com/youngsheen/SimVQ).
- **Key Results**: Integrated crucial visual and quantitative data from the GitHub README, including the SimVQ algorithm diagram, detailed quantitative comparison tables for image and audio reconstruction, and visualization examples of reconstructed images and audio. All relative image paths have been updated to absolute raw GitHub URLs.
- **Acknowledgements**: Included the acknowledgements section from the original repository.
- **Citation**: Added a BibTeX entry for easy academic citation.

This update ensures the model card is informative, well-linked, and accurately represented for the community.

Files changed (1) hide show

README.md +78 -3

README.md CHANGED Viewed

@@ -1,3 +1,78 @@
----
-license: mit
----

+---
+license: mit
+pipeline_tag: image-to-image
+---
+# SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
+This repository contains the official implementation of **SimVQ**, a novel method introduced in the paper [Addressing Representation Collapse in Vector Quantized Models with One Linear Layer](https://huggingface.co/papers/2411.02038).
+SimVQ addresses the critical problem of representation collapse in Vector Quantization (VQ) models. It proposes a simple yet effective solution by reparameterizing code vectors through a learnable linear transformation layer over a latent basis, optimizing the *entire linear space* rather than just individual code vectors. This approach effectively prevents collapse, significantly improves codebook utilization, and generalizes well across various modalities and architectures, as demonstrated on both image and audio tasks.
+-   **Paper**: [Addressing Representation Collapse in Vector Quantized Models with One Linear Layer](https://huggingface.co/papers/2411.02038)
+-   **Code**: [https://github.com/youngsheen/SimVQ](https://github.com/youngsheen/SimVQ)
+## Algorithm for SimVQ
+You can find the core code here https://github.com/youngsheen/SimVQ/blob/main/taming/modules/vqvae/quantize.py#L28-L33
+<p align="center">
+<img src="https://github.com/youngsheen/SimVQ/raw/main/assets/Algorithm.png" alt="SimVQ Algorithm">
+</p>
+## Quantitative Comparison
+**Table 1.** Reconstruction performance of different tokenizers on $128 \times 128$ ImageNet 50k validation set.
+| Method | Codebook Size | Codebook Utilization | rFID | LPIPS | PSNR | SSIM | Checkpoint |
+|:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:|
+|VQGAN | 65,536 |  1.4% | 3.74 |  0.17 | 22.20 | 70.6 | -|
+|VQGAN | 65,536 |  4.5% | 3.23 |  0.15 | 22.89 | 72.3 | -|
+|VQGAN-FC | 65,536 | 100.0% | 2.63 | 0.13 | 23.79 | 77.5 | - |
+|FSQ | 64,000 | 100.0% | 2.80 | 0.13| 23.63 | 75.8 | - |
+|LFQ | 65,536 | 100.0% | 2.88 | 0.13| 23.60 | 77.2 | - |
+|VQGAN-LC | 65,536 | 100.0% | 2.40 | 0.13 | 23.98 | 77.3 | - |
+|SimVQ (ours) | 1024 | 100.0% | 3.67 | 0.16 | 22.34 | 70.8 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_1k) |
+|SimVQ (ours) | 8192 | 100.0% | 2.98 | 0.14 | 23.23 | 74.7 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_8k) |
+|SimVQ (ours) | 65,536 | 100.0% | **2.24** | **0.12** | **24.15** | **78.4** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_65k) |
+|SimVQ (ours) | 262,144 | 100.0% | **1.99** | **0.11** | **24.68** | **80.3** | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_log/simvq_262k) |
+**Table 2.** Reconstruction performance of different tokenizers on LibriTTS test clean/other set.
+| Method | Bandwidth | Codebook Utilization | UTMOS | PESQ | STOI | V/UV F1 | Checkpoint |
+|:------:|:-------------:|:----:|:----:|:---------------------:|:----:|:----:|:----:|
+|Encodec | 3.0kbps | -/-% | 2.31/2.09 | 2.05/2.05 | 0.90/0.88 | 0.92/0.89 | - |
+|Vocos | 3.0kbps | -/-% | 3.53/3.06 | 2.40/2.19 | 0.92/0.90 | 0.94/0.91 | - |
+|SpeechTokenizer | 3.0kbps | -/-% | 3.56/3.02 | 1.93/1.74 | 0.88/0.84 | 0.93/0.89 | - |
+|WavTokenizer | 0.9kbps | 100/100% | 3.74/3.43 | 2.01/2.26 | 0.89/0.89 | 0.92/0.92 | - |
+|WavTokenizer | 1.05kbps | 27/-% | 4.00/- | 2.36/- | 0.81/- | 0.94/- | - |
+|SimVQ (ours) | 0.9kbps | 100.0/100.0% | 4.00/3.51 | 2.33/2.08 | 0.91/0.88 | 0.94/0.91 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_4k) |
+|SimVQ (ours) | 0.975kbps | 99.4/99.4% | 4.03/3.52 | 2.42/2.15 | 0.92/0.88 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_8k) |
+|SimVQ (ours) | 1.2kbps | 99.4/99.0% | 4.03/3.52 | 2.54/2.26 | 0.93/0.90 | 0.94/0.92 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_65k) |
+|SimVQ (ours) | 1.35kbps | 95.6/94.7% | 4.03/3.53 | 2.61/2.31 | 0.93/0.90 | 0.95/0.93 | [huggingface](https://huggingface.co/zyx123/SimVQ/tree/main/vq_audio_log/simvq_262k) |
+## Reconstruction Visualization
+**Figure 2.** Visualization of the Open-MAGVIT2 tokenizer trained at $128 \times 128$ resolution (`imagenet_simvq_128_Base` version). (a) indicates the original images while (b) specifies the reconstruction images.
+<p align="center">
+    <img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_image.png" alt="Image Reconstruction Visualization">
+</p>
+**Figure 3.** Visualization of the Open-MAGVIT2 tokenizer trained at LibriTTS (`libritts_24khz` version). (a) indicates the original images while (b) specifies the reconstruction images.
+<p align="center">
+    <img src="https://github.com/youngsheen/SimVQ/raw/main/assets/case_audio.png" alt="Audio Reconstruction Visualization">
+</p>
+## Acknowledgement
+The codebase of SimVQ is adapted from [Open-MAGVIT2](https://github.com/TencentARC/Open-MAGVIT2) and [WavTokenizer](https://github.com/jishengpeng/WavTokenizer). Thanks for their wonderful work.
+## Citation
+If you find our work useful, please consider citing our paper:
+```bibtex
+@misc{zhu2024addressing,
+      title={Addressing Representation Collapse in Vector Quantized Models with One Linear Layer},
+      author={Yongxin Zhu and Linli Xu},
+      year={2024},
+      eprint={2411.02038},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG}
+}
+```