| | --- |
| | language: en |
| | license: mit |
| | pipeline_tag: image-to-image |
| | tags: |
| | - diffusion |
| | - autoencoder |
| | - feature-space |
| | - svg |
| | --- |
| | |
| | # SVG: Latent Diffusion Model without Variational Autoencoder |
| |
|
| | SVG is a novel latent diffusion model framework that replaces the traditional Variational Autoencoder (VAE) latent space with semantically structured features from self-supervised vision models (e.g., DINOv3). This design improves generative capability and downstream transferability while maintaining efficiency comparable to standard VAE-based models. |
| |
|
| | ## Resources |
| |
|
| | - **Paper:** [Latent Diffusion Model without Variational Autoencoder](https://huggingface.co/papers/2510.15301) |
| | - **Project Page:** [https://howlin-wang.github.io/svg/](https://howlin-wang.github.io/svg/) |
| | - **GitHub Repository:** [https://github.com/shiml20/SVG](https://github.com/shiml20/SVG) |
| |
|
| | ## Model Description |
| |
|
| | SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. |
| |
|
| | **Key features:** |
| | - Replaces low-dimensional VAE latent space with high-dimensional semantic feature space. |
| | - Includes a lightweight residual encoder for refining fine-grained details. |
| | - Enables accelerated diffusion training and supports few-step sampling. |
| | - Improves generative quality while preserving semantic and discriminative capabilities. |
| |
|
| | ## Usage |
| |
|
| | For full instructions on training and evaluation, please refer to the official [GitHub repository](https://github.com/shiml20/SVG). |
| |
|
| | ### Installation |
| | ```bash |
| | conda create -n svg python=3.10 -y |
| | conda activate svg |
| | pip install -r requirements.txt |
| | ``` |
| |
|
| | ### Generation |
| | To generate images using a trained model: |
| | ```bash |
| | # Update ckpt_path in sample_svg.py with your checkpoint |
| | python sample_svg.py |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | If you find this work useful for your research, please cite: |
| |
|
| | ```bibtex |
| | @misc{shi2025latentdiffusionmodelvariational, |
| | title={Latent Diffusion Model without Variational Autoencoder}, |
| | author={Minglei Shi and Haolin Wang and Wenzhao Zheng and Ziyang Yuan and Xiaoshi Wu and Xintao Wang and Pengfei Wan and Jie Zhou and Jiwen Lu}, |
| | year={2025}, |
| | eprint={2510.15301}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CV}, |
| | url={https://arxiv.org/abs/2510.15301}, |
| | } |
| | ``` |