Add pipeline tag and improve model card metadata

Hi! I'm Niels from the Hugging Face community team. I've updated the model card to include the `pipeline_tag: image-to-image` metadata, which helps users find the model through the Hub's task filters. I've also refined the metadata and organized the project links for better clarity.

Files changed (1) hide show

README.md +40 -15

README.md CHANGED Viewed

@@ -1,39 +1,64 @@
 ---
 language: en
 license: mit
 tags:
 - diffusion
 - autoencoder
 - feature-space
 - svg
-references:
-- https://arxiv.org/abs/2510.15301
 ---
 # SVG: Latent Diffusion Model without Variational Autoencoder
-## Model Description
-SVG is a latent diffusion model framework that replaces the traditional VAE latent space with semantically structured features from self-supervised vision models (e.g., DINOv3). This design improves generative capability and downstream transferability while maintaining efficiency comparable to standard VAE-based latent diffusion models.
-Key features:
-- Replaces low-dimensional VAE latent space with high-dimensional semantic feature space.
-- Includes a lightweight residual encoder for refining fine-grained details.
-- Enables strong generation and perception performance.
-## How to Use
-For code, and instructions, see the GitHub repository:
-[https://github.com/shiml20/SVG](https://github.com/shiml20/SVG)
-Official project page:
-[https://howlin-wang.github.io/svg/](https://howlin-wang.github.io/svg/)
-Arxiv paper:
-[https://arxiv.org/abs/2510.15301](https://arxiv.org/abs/2510.15301)

 ---
 language: en
 license: mit
+pipeline_tag: image-to-image
 tags:
 - diffusion
 - autoencoder
 - feature-space
 - svg
 ---
 # SVG: Latent Diffusion Model without Variational Autoencoder
+SVG is a novel latent diffusion model framework that replaces the traditional Variational Autoencoder (VAE) latent space with semantically structured features from self-supervised vision models (e.g., DINOv3). This design improves generative capability and downstream transferability while maintaining efficiency comparable to standard VAE-based models.
+## Resources
+- **Paper:** [Latent Diffusion Model without Variational Autoencoder](https://huggingface.co/papers/2510.15301)
+- **Project Page:** [https://howlin-wang.github.io/svg/](https://howlin-wang.github.io/svg/)
+- **GitHub Repository:** [https://github.com/shiml20/SVG](https://github.com/shiml20/SVG)
+## Model Description
+SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning.
+**Key features:**
+- Replaces low-dimensional VAE latent space with high-dimensional semantic feature space.
+- Includes a lightweight residual encoder for refining fine-grained details.
+- Enables accelerated diffusion training and supports few-step sampling.
+- Improves generative quality while preserving semantic and discriminative capabilities.
+## Usage
+For full instructions on training and evaluation, please refer to the official [GitHub repository](https://github.com/shiml20/SVG).
+### Installation
+```bash
+conda create -n svg python=3.10 -y
+conda activate svg
+pip install -r requirements.txt
+```
+### Generation
+To generate images using a trained model:
+```bash
+# Update ckpt_path in sample_svg.py with your checkpoint
+python sample_svg.py
+```
+## Citation
+If you find this work useful for your research, please cite:
+```bibtex
+@misc{shi2025latentdiffusionmodelvariational,
+      title={Latent Diffusion Model without Variational Autoencoder},
+      author={Minglei Shi and Haolin Wang and Wenzhao Zheng and Ziyang Yuan and Xiaoshi Wu and Xintao Wang and Pengfei Wan and Jie Zhou and Jiwen Lu},
+      year={2025},
+      eprint={2510.15301},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2510.15301},
+}
+```