File size: 14,124 Bytes
bab94ab
b8339c5
2d24f1e
b8339c5
 
 
 
 
 
 
 
 
 
 
 
 
 
ee05533
bab94ab
 
cf6c582
bab94ab
 
e019ba1
c4f2039
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bab94ab
78ec706
bab94ab
 
 
 
 
78ec706
bab94ab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0d8824c
bab94ab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c28b391
62cafc9
 
bab94ab
62cafc9
bab94ab
62cafc9
 
bab94ab
 
c28b391
bab94ab
 
 
 
 
80f8825
 
bab94ab
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342

---
license: apache-2.0
tags:
  - text-to-image
  - diffusion
  - latent-diffusion
  - visual-foundation-model
  - representation-learning
  - dino
  - svg
pipeline_tag: text-to-image
library_name: pytorch
language:
  - en
---

<h1 align="center">SVG-T2I<br><sub><sup>Scaling up Text-to-Image Latent Diffusion Models without Variational Autoencoders</sup></sub></h1>


<div align="center">


<a href="https://arxiv.org/abs/2512.11749" target="_blank">
    <img alt="arXiv" src="https://img.shields.io/badge/arXiv-SVG--T2I-red?logo=arxiv" height="25" />
</a>
<a href="https://github.com/KlingTeam/SVG-T2I" target="_blank">
    <img alt="Github" src="https://img.shields.io/badge/โš’๏ธ_Github-Code-white.svg" height="25" />
</a>
<a href="https://huggingface.co/KlingTeam/SVG-T2I" target="_blank">
    <img alt="HF Model: SVG-T2I" src="https://img.shields.io/badge/%F0%9F%A4%97%20_Model-SVG--T2I-ffc107?color=ffc107&logoColor=white" height="25" />
</a>
<a href="https://cloud.tsinghua.edu.cn/f/7f6ee030f273427cba4b/" target="_blank">
    <img alt="PDF" src="https://img.shields.io/badge/๐Ÿ“„_PDF-Paper-red.svg" height="25" />
</a>
<a href="LICENSE" target="_blank">
    <img alt="License" src="https://img.shields.io/badge/License-MIT-blue.svg" height="25" />
</a>
<br>
<a href="https://arxiv.org/abs/2510.15301" target="_blank">
    <img alt="arXiv" src="https://img.shields.io/badge/arXiv-SVG-red?logo=arxiv" height="25" />
</a>
<a href="https://github.com/shiml20/SVG" target="_blank">
    <img alt="Github" src="https://img.shields.io/badge/โš’๏ธ_Github-Code-white.svg" height="25" />
</a>
<a href="https://huggingface.co/howlin/SVG" target="_blank">
    <img alt="HF Model: SVG" src="https://img.shields.io/badge/%F0%9F%A4%97%20_Model-SVG-ffc107?color=ffc107&logoColor=white" height="25" />
</a>
<br>

_**[Minglei Shi](https://github.com/shiml20)<sup>1\*</sup>, [Haolin Wang](https://howlin-wang.github.io)<sup>1\*</sup>, [Borui Zhang](https://boruizhang.site/)<sup>1</sup>, [Wenzhao Zheng](https://wzzheng.net)<sup>1</sup>, [Bohan Zeng](https://scholar.google.com/citations?user=MHo_d3YAAAAJ&hl=en)<sup>2</sup>**_
_**[Ziyang Yuan](https://scholar.google.ru/citations?user=fWxWEzsAAAAJ&hl=en)<sup>2โ€ </sup>, [Xiaoshi Wu](https://scholar.google.com/citations?user=cnOAMbUAAAAJ&hl=en)<sup>2</sup>, [Yuanxing Zhang](https://scholar.google.com/citations?user=COdftTMAAAAJ&hl=en)<sup>2</sup>, [Huan Yang](https://hyang0511.github.io/)<sup>2</sup>**_
_**[Xintao Wang](https://xinntao.github.io/)<sup>2</sup>, [Pengfei Wan](https://magicwpf.github.io/)<sup>2</sup>, [Kun Gai](https://scholar.google.com/citations?user=PXO4ygEAAAAJ&hl=zh-CN)<sup>2</sup>, [Jie Zhou](https://scholar.google.com/citations?user=6a79aPwAAAAJ&hl=en)<sup>1</sup>, [Jiwen Lu](https://ivg.au.tsinghua.edu.cn/Jiwen_Lu/)<sup>1โ€ </sup>**_

<sup>1</sup>Tsinghua University &nbsp;&nbsp; <sup>2</sup>KlingTeam, Kuaishou Technology
<br>
<small>\* Equal contribution &nbsp;&nbsp; โ€  Corresponding author</small>

</div>

---

> **Important Note:** This repository implements SVG-T2I, a text-to-image diffusion framework that performs visual generation directly in Visual Foundation Model (VFM) representation space, rather than pixel space or vae space.
>

---


## ๐Ÿ“ฐ News
- **[2025-12-13]** ๐Ÿ“ขโœจ We are excited to announce the official release of **SVG-T2I**, including pre-trained checkpoints as well as complete training and inference code.


## ๐Ÿ–ผ๏ธ Gallery

<div align="center">
  <img src="assets/viz_t2i_1.png" width="80%" alt="Teaser Image"/>
  <br>
  <em>High-fidelity samples generated by SVG-T2I.</em>
</div>

---

## ๐Ÿง  Overview

Visual generation grounded in Visual Foundation Model (VFM) representations offers a promising unified approach to visual understanding and generation. However, large-scale text-to-image diffusion models operating directly in VFM feature space remain underexplored.

To address this, SVG-T2I extends the SVG framework to enable high-quality text-to-image synthesis directly in the VFM domain using a standard diffusion pipeline. The model achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench, demonstrating the strong generative capability of VFM representations.

We fully open-source the autoencoder and generation models, along with their training, inference, and evaluation pipelines, to support future research in representation-driven visual generation.

### Why SVG-T2I?

- **โœจ Direct Use of VFM Representations:**  
  SVG-T2I performs generation **directly in the feature space of Visual Foundation Models (e.g., DINOv3)**, rather than aligning it. This preserves **rich semantic structure** learned from large-scale self-supervised visual representation learning.

- **๐Ÿ”— Unified Representation for Understanding and Generation:**  
  By **sharing the same VFM representation space** across **visual understanding, perception, and generation**, SVG-T2I unlocks strong potential for **downstream tasks** such as **image editing, retrieval, reasoning, and multimodal alignment**.

- **๐Ÿงฉ Fully Open-Sourced Pipeline:**  
  We **fully open-source** the **entire training and inference pipeline**, including the **SVG autoencoder, diffusion model, evaluation code, and pretrained checkpoints**, to facilitate **reproducibility and future research** in representation-driven visual generation.


---

## ๐ŸŒŸ Key Components

| Component | Description |
| :--- | :--- |
| **1. SVG Autoencoder** | A novel latent codec consisting of a **Frozen VFM (DINOv3/DINOv2/SIGLIP2/MAE)** encoder, an optional residual reconstruction branch, and a trainable convolutional decoder. <br>โŒ No Quantization <br>โŒ No KL-loss <br>โŒ No Gaussian Assumption |
| **2. Latent Diffusion** | A **Single-stream Diffusion Transformer** trained directly on representation space. Supports progressive training (256โ†’512โ†’1024) and is optimized on large-scale text-image pairs. |

---


## ๐ŸŽฎ Model Zoo


### **SVG Autoencoder**

| Model | Notes | Resol. | Encoder (Params) | Download URL |
| ----- | ----- | ------ | ---------------- | ------------ |
| Autoencoder-P | Stage1 (Low-resol) | 256  | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) (29M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
| Autoencoder-P | Stage2 (Middle-resol) | 512  | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) (29M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
| Autoencoder-P | Stage3 (High-resol) (๐Ÿ˜„ **Default**) | 1024 | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) (29M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
| Autoencoder-R | Stage1 (Low-resol) | 256  | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) + [Residual ViT](https://huggingface.co/KlingTeam/SVG-T2I) (29M + 22M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
| Autoencoder-R | Stage2 (Middle-resol) | 512  | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) + [Residual ViT](https://huggingface.co/KlingTeam/SVG-T2I) (29M + 22M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
| Autoencoder-R | Stage3 (High-resol) | 1024 | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) + [Residual ViT](https://huggingface.co/KlingTeam/SVG-T2I) (29M + 22M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |


### **SVG-T2I DiT**

| Notes | Resol. | Parameter | Text Encoder | Representation Encoder | Download URL  |
| ----- | ---------- | --------- | ------------ | -------------------- | ------------- |
|Stage1 (Low-resol)| 256        | 2.6B      | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b)  | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
|Stage2 (Middle-resol)| 512        | 2.6B      | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b)  | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
|Stage3 (High-resol)| 1024       | 2.6B      | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b)  | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
|Stage4 (SFT)(๐Ÿ˜„**Default**)| 1024       | 2.6B      | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b)  | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |

> Model Name: TxxxM indicates the total number of images the model has cumulatively seen during training.

---


## ๐Ÿ› ๏ธ Installation

### 1\. Environment Setup

```bash
conda create -n svg_t2i python=3.10 -y
conda activate svg_t2i
pip install -r requirements.txt
```

### 2\. Download DINOv3

SVG-T2I relies on DINOv3 as the frozen encoder.

```bash
# Download DINOv3 pretrained weights and update your config paths
git clone https://github.com/facebookresearch/dinov3.git
```

### 2. Download Pre-trained Models

You can download **all stage-wise pretrained models and checkpoints** from our official **Hugging Face repository**, including the **SVG autoencoder** and **SVG-T2I diffusion models** used for training and evaluation:

```bash
https://huggingface.co/KlingTeam/SVG-T2I
````

These pretrained weights are released to support **academic research, benchmarking, and a wide range of downstream applications**, and can be freely used for **experimentation, analysis, and further development**.

-----

## ๐Ÿ“ฆ Data Preparation

### 1\. Autoencoder Training Data

Any large-scale image dataset works (e.g., ImageNet-1K). Update `autoencoder/pure/configs/*.yaml`:

For ImageNet-1K

```yaml
data:
  target: "utils.data_module_allinone.DataModuleFromConfig"
  params:
    batch_size: 64
    wrap: true
    num_workers: 16
    train:
      target: ldm.data.imagenet.ImageNetTrain
      params:
        data_root: Your ImageNet Path
        size: 256
    validation:
      target: ldm.data.imagenet.ImageNetValidation
      params:
        data_root: Your ImageNet Path
        size: 256
```

For customized Dataset

We support customized **JSONL** formats. Example file in `configs/example.jsonl` 

(`prompt` only used in Generation Task).

**Example JSONL Format:**

```json
{"path": "test/man.jpg", "prompt": "A man"}
{"path": "test/man.jpg", "prompt": "A man"}
...
```


```yaml
data:
  target: utils.data_module_allinone.DataModuleFromConfigJson
  params:
    batch_size: 3 # batch size per GPU
    wrap: true
    train_resol: 256
    json_path: configs/example.jsonl

```


### 2\. Text-to-Image Training Data

**Example JSONL Format:**

```json
{"path": "test/man.jpg", "prompt": "A man"}
{"path": "test/man.jpg", "prompt": "A man"}
...
```

-----

## ๐Ÿš€ Training

SVG-T2I training is divided into two distinct stages.

### Stage 1: Train SVG Autoencoder

Navigate to the `autoencoder` directory and launch training:

```bash
cd autoencoder
bash run_train.sh <GPU NUM> configs/pure/svg_autoencoder_P_dd_M_IN_stage1_bs64_256_gpu1_forTest
# example 
bash run_train.sh 1 configs/pure/svg_autoencoder_P_dd_M_IN_stage1_bs64_256_gpu1_forTest
````

* **Output:** Results will be saved in `autoencoder/logs`.
* **Note:** You can modify training hyperparameters and output paths directly inside `run_train.sh` or the configuration YAML file.

### Stage 2: Train SVG-DiT (Diffusion)

Navigate to `svg_t2i`. We provide scripts for both single-node and multi-node training.

**Single Node Example:**

```bash
cd svg_t2i
bash scripts/run_train_1gpus_forTest.sh <RANK ID>
# example
bash scripts/run_train_1gpus_forTest.sh 0
```

**Multi-Node Example:**

```bash
bash scripts/run_train_mnodes.sh 0 
bash scripts/run_train_mnodes.sh 1
bash scripts/run_train_mnodes.sh 2
bash scripts/run_train_mnodes.sh 3
```

* **Output:** Results will be saved in `svg_t2i/results`.
* **Note:** You can adjust learning rates, batch sizes, number of GPUs, and save directories directly in the training scripts.

---

## ๐ŸŽจ Inference & Image Generation

Generate images using a pretrained **SVG-DiT** model.

> After downloading the pretrained checkpoints, you will obtain a `pre-trained/` directory.  
> Please place this directory under the `svg_t2i/` folder before running inference.

```bash
cd svg_t2i
bash scripts/sample.sh
```

* **Output:** Results will be saved in `svg_t2i/samples`.
* **Note:** You can modify sampling parameters, prompt settings, and output directories directly inside `sample.sh`.





## ๐Ÿ“ Citation

If you find this work helpful, please cite our papers:

```bibtex
@misc{svgt2i2025,
      title={SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder}, 
      author={Minglei Shi and Haolin Wang and Borui Zhang and Wenzhao Zheng and Bohan Zeng and Ziyang Yuan and Xiaoshi Wu and Yuanxing Zhang and Huan Yang and Xintao Wang and Pengfei Wan and Kun Gai and Jie Zhou and Jiwen Lu},
      year={2025},
      eprint={2512.11749},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.11749}, 
}

@misc{svg2025,
      title={Latent Diffusion Model without Variational Autoencoder}, 
      author={Minglei Shi and Haolin Wang and Wenzhao Zheng and Ziyang Yuan and Xiaoshi Wu and Xintao Wang and Pengfei Wan and Jie Zhou and Jiwen Lu},
      year={2025},
      eprint={2510.15301},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.15301}, 
}
```

-----

## ๐Ÿ’ก Acknowledgments

SVG-T2I builds upon the giants of the open-source community:

  * **[SVG](https://github.com/shiml20/SVG)**: Base pipeline and core idea
  * **[Lumina-Image-2.0](https://github.com/Alpha-VLLM/Lumina-T2X)**: DiT architecture and base training code.
  * **[DINOv3](https://github.com/facebookresearch/dinov3)**: State-of-the-art semantic representation encoder.

For any questions, please open a [GitHub Issue](https://www.google.com/search?q=https://github.com/KlingTeam/SVG-T2I/issues).