File size: 14,124 Bytes
bab94ab b8339c5 2d24f1e b8339c5 ee05533 bab94ab cf6c582 bab94ab e019ba1 c4f2039 bab94ab 78ec706 bab94ab 78ec706 bab94ab 0d8824c bab94ab c28b391 62cafc9 bab94ab 62cafc9 bab94ab 62cafc9 bab94ab c28b391 bab94ab 80f8825 bab94ab |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 |
---
license: apache-2.0
tags:
- text-to-image
- diffusion
- latent-diffusion
- visual-foundation-model
- representation-learning
- dino
- svg
pipeline_tag: text-to-image
library_name: pytorch
language:
- en
---
<h1 align="center">SVG-T2I<br><sub><sup>Scaling up Text-to-Image Latent Diffusion Models without Variational Autoencoders</sup></sub></h1>
<div align="center">
<a href="https://arxiv.org/abs/2512.11749" target="_blank">
<img alt="arXiv" src="https://img.shields.io/badge/arXiv-SVG--T2I-red?logo=arxiv" height="25" />
</a>
<a href="https://github.com/KlingTeam/SVG-T2I" target="_blank">
<img alt="Github" src="https://img.shields.io/badge/โ๏ธ_Github-Code-white.svg" height="25" />
</a>
<a href="https://huggingface.co/KlingTeam/SVG-T2I" target="_blank">
<img alt="HF Model: SVG-T2I" src="https://img.shields.io/badge/%F0%9F%A4%97%20_Model-SVG--T2I-ffc107?color=ffc107&logoColor=white" height="25" />
</a>
<a href="https://cloud.tsinghua.edu.cn/f/7f6ee030f273427cba4b/" target="_blank">
<img alt="PDF" src="https://img.shields.io/badge/๐_PDF-Paper-red.svg" height="25" />
</a>
<a href="LICENSE" target="_blank">
<img alt="License" src="https://img.shields.io/badge/License-MIT-blue.svg" height="25" />
</a>
<br>
<a href="https://arxiv.org/abs/2510.15301" target="_blank">
<img alt="arXiv" src="https://img.shields.io/badge/arXiv-SVG-red?logo=arxiv" height="25" />
</a>
<a href="https://github.com/shiml20/SVG" target="_blank">
<img alt="Github" src="https://img.shields.io/badge/โ๏ธ_Github-Code-white.svg" height="25" />
</a>
<a href="https://huggingface.co/howlin/SVG" target="_blank">
<img alt="HF Model: SVG" src="https://img.shields.io/badge/%F0%9F%A4%97%20_Model-SVG-ffc107?color=ffc107&logoColor=white" height="25" />
</a>
<br>
_**[Minglei Shi](https://github.com/shiml20)<sup>1\*</sup>, [Haolin Wang](https://howlin-wang.github.io)<sup>1\*</sup>, [Borui Zhang](https://boruizhang.site/)<sup>1</sup>, [Wenzhao Zheng](https://wzzheng.net)<sup>1</sup>, [Bohan Zeng](https://scholar.google.com/citations?user=MHo_d3YAAAAJ&hl=en)<sup>2</sup>**_
_**[Ziyang Yuan](https://scholar.google.ru/citations?user=fWxWEzsAAAAJ&hl=en)<sup>2โ </sup>, [Xiaoshi Wu](https://scholar.google.com/citations?user=cnOAMbUAAAAJ&hl=en)<sup>2</sup>, [Yuanxing Zhang](https://scholar.google.com/citations?user=COdftTMAAAAJ&hl=en)<sup>2</sup>, [Huan Yang](https://hyang0511.github.io/)<sup>2</sup>**_
_**[Xintao Wang](https://xinntao.github.io/)<sup>2</sup>, [Pengfei Wan](https://magicwpf.github.io/)<sup>2</sup>, [Kun Gai](https://scholar.google.com/citations?user=PXO4ygEAAAAJ&hl=zh-CN)<sup>2</sup>, [Jie Zhou](https://scholar.google.com/citations?user=6a79aPwAAAAJ&hl=en)<sup>1</sup>, [Jiwen Lu](https://ivg.au.tsinghua.edu.cn/Jiwen_Lu/)<sup>1โ </sup>**_
<sup>1</sup>Tsinghua University <sup>2</sup>KlingTeam, Kuaishou Technology
<br>
<small>\* Equal contribution โ Corresponding author</small>
</div>
---
> **Important Note:** This repository implements SVG-T2I, a text-to-image diffusion framework that performs visual generation directly in Visual Foundation Model (VFM) representation space, rather than pixel space or vae space.
>
---
## ๐ฐ News
- **[2025-12-13]** ๐ขโจ We are excited to announce the official release of **SVG-T2I**, including pre-trained checkpoints as well as complete training and inference code.
## ๐ผ๏ธ Gallery
<div align="center">
<img src="assets/viz_t2i_1.png" width="80%" alt="Teaser Image"/>
<br>
<em>High-fidelity samples generated by SVG-T2I.</em>
</div>
---
## ๐ง Overview
Visual generation grounded in Visual Foundation Model (VFM) representations offers a promising unified approach to visual understanding and generation. However, large-scale text-to-image diffusion models operating directly in VFM feature space remain underexplored.
To address this, SVG-T2I extends the SVG framework to enable high-quality text-to-image synthesis directly in the VFM domain using a standard diffusion pipeline. The model achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench, demonstrating the strong generative capability of VFM representations.
We fully open-source the autoencoder and generation models, along with their training, inference, and evaluation pipelines, to support future research in representation-driven visual generation.
### Why SVG-T2I?
- **โจ Direct Use of VFM Representations:**
SVG-T2I performs generation **directly in the feature space of Visual Foundation Models (e.g., DINOv3)**, rather than aligning it. This preserves **rich semantic structure** learned from large-scale self-supervised visual representation learning.
- **๐ Unified Representation for Understanding and Generation:**
By **sharing the same VFM representation space** across **visual understanding, perception, and generation**, SVG-T2I unlocks strong potential for **downstream tasks** such as **image editing, retrieval, reasoning, and multimodal alignment**.
- **๐งฉ Fully Open-Sourced Pipeline:**
We **fully open-source** the **entire training and inference pipeline**, including the **SVG autoencoder, diffusion model, evaluation code, and pretrained checkpoints**, to facilitate **reproducibility and future research** in representation-driven visual generation.
---
## ๐ Key Components
| Component | Description |
| :--- | :--- |
| **1. SVG Autoencoder** | A novel latent codec consisting of a **Frozen VFM (DINOv3/DINOv2/SIGLIP2/MAE)** encoder, an optional residual reconstruction branch, and a trainable convolutional decoder. <br>โ No Quantization <br>โ No KL-loss <br>โ No Gaussian Assumption |
| **2. Latent Diffusion** | A **Single-stream Diffusion Transformer** trained directly on representation space. Supports progressive training (256โ512โ1024) and is optimized on large-scale text-image pairs. |
---
## ๐ฎ Model Zoo
### **SVG Autoencoder**
| Model | Notes | Resol. | Encoder (Params) | Download URL |
| ----- | ----- | ------ | ---------------- | ------------ |
| Autoencoder-P | Stage1 (Low-resol) | 256 | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) (29M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
| Autoencoder-P | Stage2 (Middle-resol) | 512 | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) (29M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
| Autoencoder-P | Stage3 (High-resol) (๐ **Default**) | 1024 | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) (29M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
| Autoencoder-R | Stage1 (Low-resol) | 256 | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) + [Residual ViT](https://huggingface.co/KlingTeam/SVG-T2I) (29M + 22M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
| Autoencoder-R | Stage2 (Middle-resol) | 512 | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) + [Residual ViT](https://huggingface.co/KlingTeam/SVG-T2I) (29M + 22M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
| Autoencoder-R | Stage3 (High-resol) | 1024 | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) + [Residual ViT](https://huggingface.co/KlingTeam/SVG-T2I) (29M + 22M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
### **SVG-T2I DiT**
| Notes | Resol. | Parameter | Text Encoder | Representation Encoder | Download URL |
| ----- | ---------- | --------- | ------------ | -------------------- | ------------- |
|Stage1 (Low-resol)| 256 | 2.6B | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b) | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
|Stage2 (Middle-resol)| 512 | 2.6B | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b) | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
|Stage3 (High-resol)| 1024 | 2.6B | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b) | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
|Stage4 (SFT)(๐**Default**)| 1024 | 2.6B | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b) | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
> Model Name: TxxxM indicates the total number of images the model has cumulatively seen during training.
---
## ๐ ๏ธ Installation
### 1\. Environment Setup
```bash
conda create -n svg_t2i python=3.10 -y
conda activate svg_t2i
pip install -r requirements.txt
```
### 2\. Download DINOv3
SVG-T2I relies on DINOv3 as the frozen encoder.
```bash
# Download DINOv3 pretrained weights and update your config paths
git clone https://github.com/facebookresearch/dinov3.git
```
### 2. Download Pre-trained Models
You can download **all stage-wise pretrained models and checkpoints** from our official **Hugging Face repository**, including the **SVG autoencoder** and **SVG-T2I diffusion models** used for training and evaluation:
```bash
https://huggingface.co/KlingTeam/SVG-T2I
````
These pretrained weights are released to support **academic research, benchmarking, and a wide range of downstream applications**, and can be freely used for **experimentation, analysis, and further development**.
-----
## ๐ฆ Data Preparation
### 1\. Autoencoder Training Data
Any large-scale image dataset works (e.g., ImageNet-1K). Update `autoencoder/pure/configs/*.yaml`:
For ImageNet-1K
```yaml
data:
target: "utils.data_module_allinone.DataModuleFromConfig"
params:
batch_size: 64
wrap: true
num_workers: 16
train:
target: ldm.data.imagenet.ImageNetTrain
params:
data_root: Your ImageNet Path
size: 256
validation:
target: ldm.data.imagenet.ImageNetValidation
params:
data_root: Your ImageNet Path
size: 256
```
For customized Dataset
We support customized **JSONL** formats. Example file in `configs/example.jsonl`
(`prompt` only used in Generation Task).
**Example JSONL Format:**
```json
{"path": "test/man.jpg", "prompt": "A man"}
{"path": "test/man.jpg", "prompt": "A man"}
...
```
```yaml
data:
target: utils.data_module_allinone.DataModuleFromConfigJson
params:
batch_size: 3 # batch size per GPU
wrap: true
train_resol: 256
json_path: configs/example.jsonl
```
### 2\. Text-to-Image Training Data
**Example JSONL Format:**
```json
{"path": "test/man.jpg", "prompt": "A man"}
{"path": "test/man.jpg", "prompt": "A man"}
...
```
-----
## ๐ Training
SVG-T2I training is divided into two distinct stages.
### Stage 1: Train SVG Autoencoder
Navigate to the `autoencoder` directory and launch training:
```bash
cd autoencoder
bash run_train.sh <GPU NUM> configs/pure/svg_autoencoder_P_dd_M_IN_stage1_bs64_256_gpu1_forTest
# example
bash run_train.sh 1 configs/pure/svg_autoencoder_P_dd_M_IN_stage1_bs64_256_gpu1_forTest
````
* **Output:** Results will be saved in `autoencoder/logs`.
* **Note:** You can modify training hyperparameters and output paths directly inside `run_train.sh` or the configuration YAML file.
### Stage 2: Train SVG-DiT (Diffusion)
Navigate to `svg_t2i`. We provide scripts for both single-node and multi-node training.
**Single Node Example:**
```bash
cd svg_t2i
bash scripts/run_train_1gpus_forTest.sh <RANK ID>
# example
bash scripts/run_train_1gpus_forTest.sh 0
```
**Multi-Node Example:**
```bash
bash scripts/run_train_mnodes.sh 0
bash scripts/run_train_mnodes.sh 1
bash scripts/run_train_mnodes.sh 2
bash scripts/run_train_mnodes.sh 3
```
* **Output:** Results will be saved in `svg_t2i/results`.
* **Note:** You can adjust learning rates, batch sizes, number of GPUs, and save directories directly in the training scripts.
---
## ๐จ Inference & Image Generation
Generate images using a pretrained **SVG-DiT** model.
> After downloading the pretrained checkpoints, you will obtain a `pre-trained/` directory.
> Please place this directory under the `svg_t2i/` folder before running inference.
```bash
cd svg_t2i
bash scripts/sample.sh
```
* **Output:** Results will be saved in `svg_t2i/samples`.
* **Note:** You can modify sampling parameters, prompt settings, and output directories directly inside `sample.sh`.
## ๐ Citation
If you find this work helpful, please cite our papers:
```bibtex
@misc{svgt2i2025,
title={SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder},
author={Minglei Shi and Haolin Wang and Borui Zhang and Wenzhao Zheng and Bohan Zeng and Ziyang Yuan and Xiaoshi Wu and Yuanxing Zhang and Huan Yang and Xintao Wang and Pengfei Wan and Kun Gai and Jie Zhou and Jiwen Lu},
year={2025},
eprint={2512.11749},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.11749},
}
@misc{svg2025,
title={Latent Diffusion Model without Variational Autoencoder},
author={Minglei Shi and Haolin Wang and Wenzhao Zheng and Ziyang Yuan and Xiaoshi Wu and Xintao Wang and Pengfei Wan and Jie Zhou and Jiwen Lu},
year={2025},
eprint={2510.15301},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.15301},
}
```
-----
## ๐ก Acknowledgments
SVG-T2I builds upon the giants of the open-source community:
* **[SVG](https://github.com/shiml20/SVG)**: Base pipeline and core idea
* **[Lumina-Image-2.0](https://github.com/Alpha-VLLM/Lumina-T2X)**: DiT architecture and base training code.
* **[DINOv3](https://github.com/facebookresearch/dinov3)**: State-of-the-art semantic representation encoder.
For any questions, please open a [GitHub Issue](https://www.google.com/search?q=https://github.com/KlingTeam/SVG-T2I/issues).
|