MingleiShi commited on
Commit
bab94ab
ยท
verified ยท
1 Parent(s): f5bf0e0

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +310 -3
README.md CHANGED
@@ -1,3 +1,310 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ <div align="center">
3
+
4
+ <!-- <img src="assets/logo.png" width="400"/> -->
5
+
6
+ # SVG-T2I: Scaling up Text-to-Image Latent Diffusion Model <br> Without Variational Autoencoder
7
+
8
+ [![arXiv](https://img.shields.io/badge/arXiv-25xx.xxxxx-b31b1b.svg)](https://arxiv.org/abs/xxxx.xxxxx)
9
+ [![Code](https://img.shields.io/badge/GitHub-SVG--T2I-black)](https://github.com/KlingTeam/SVG-T2I)
10
+ [![Model Weights](https://img.shields.io/badge/Model-SVG--T2I-yellow)](https://huggingface.co/KlingTeam/SVG-T2I)
11
+ [![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
12
+
13
+
14
+ [![arXiv](https://img.shields.io/badge/arXiv-SVG-b31b1b.svg)](https://arxiv.org/abs/2510.15301)
15
+ [![Code](https://img.shields.io/badge/GitHub-SVG-black)](https://github.com/shiml20/SVG)
16
+ [![Model Weights](https://img.shields.io/badge/Model-SVG-yellow)](https://huggingface.co/howlin/SVG)
17
+
18
+ _**[Minglei Shi](https://github.com/shiml20)<sup>1*</sup>, [Haolin Wang](https://howlin-wang.github.io)<sup>1*</sup>, [Borui Zhang](https://boruizhang.site/)<sup>1</sup>, [Wenzhao Zheng](https://wzzheng.net)<sup>1</sup>, [Bohan Zeng](https://scholar.google.com/citations?user=MHo_d3YAAAAJ&hl=en)<sup>2</sup>**_
19
+ _**[Ziyang Yuan](https://scholar.google.ru/citations?user=fWxWEzsAAAAJ&hl=en)<sup>2โ€ </sup>, [Xiaoshi Wu](https://scholar.google.com/citations?user=cnOAMbUAAAAJ&hl=en)<sup>2</sup>, [Yuanxing Zhang](https://scholar.google.com/citations?user=COdftTMAAAAJ&hl=en)<sup>2</sup>, [Huan Yang](https://hyang0511.github.io/)<sup>2</sup>**_
20
+ _**[Xintao Wang](https://xinntao.github.io/)<sup>2</sup>, [Pengfei Wan](https://magicwpf.github.io/)<sup>2</sup>, [Kun Gai](https://scholar.google.com/citations?user=PXO4ygEAAAAJ&hl=zh-CN)<sup>2</sup>, [Jie Zhou](https://scholar.google.com/citations?user=6a79aPwAAAAJ&hl=en)<sup>1</sup>, [Jiwen Lu](https://ivg.au.tsinghua.edu.cn/Jiwen_Lu/)<sup>1โ€ </sup>**_
21
+
22
+ <br>
23
+
24
+ <sup>1</sup>Tsinghua University &nbsp;&nbsp; <sup>2</sup>KlingTeam, Kuaishou Technology
25
+ <br>
26
+ <small>* Equal contribution &nbsp;&nbsp; โ€  Corresponding author</small>
27
+
28
+ </div>
29
+
30
+ ---
31
+
32
+ > **Important Note:** This repository implements SVG-T2I, a text-to-image diffusion framework that performs visual generation directly in Visual Foundation Model (VFM) representation space, rather than pixel space or vae space.
33
+ >
34
+
35
+ ---
36
+
37
+
38
+ ## ๐Ÿ“ฐ News
39
+ - **[2025-12-13]** ๐Ÿ“ขโœจ We are excited to announce the official release of **SVG-T2I**, including pre-trained checkpoints as well as complete training and inference code.
40
+
41
+
42
+ ## ๐Ÿ–ผ๏ธ Gallery
43
+
44
+ <div align="center">
45
+ <img src="assets/viz_t2i_1.png" width="80%" alt="Teaser Image"/>
46
+ <br>
47
+ <em>High-fidelity samples generated by SVG-T2I.</em>
48
+ </div>
49
+
50
+ ---
51
+
52
+ ## ๐Ÿง  Overview
53
+
54
+ Visual generation grounded in Visual Foundation Model (VFM) representations offers a promising unified approach to visual understanding and generation. However, large-scale text-to-image diffusion models operating directly in VFM feature space remain underexplored.
55
+
56
+ To address this, SVG-T2I extends the SVG framework to enable high-quality text-to-image synthesis directly in the VFM domain using a standard diffusion pipeline. The model achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench, demonstrating the strong generative capability of VFM representations.
57
+
58
+ We fully open-source the autoencoder and generation models, along with their training, inference, and evaluation pipelines, to support future research in representation-driven visual generation.
59
+
60
+ ### Why SVG-T2I?
61
+
62
+ - **โœจ Direct Use of VFM Representations:**
63
+ SVG-T2I performs generation **directly in the feature space of Visual Foundation Models (e.g., DINOv3)**, rather than aligning it. This preserves **rich semantic structure** learned from large-scale self-supervised visual representation learning.
64
+
65
+ - **๐Ÿ”— Unified Representation for Understanding and Generation:**
66
+ By **sharing the same VFM representation space** across **visual understanding, perception, and generation**, SVG-T2I unlocks strong potential for **downstream tasks** such as **image editing, retrieval, reasoning, and multimodal alignment**.
67
+
68
+ - **๐Ÿงฉ Fully Open-Sourced Pipeline:**
69
+ We **fully open-source** the **entire training and inference pipeline**, including the **SVG autoencoder, diffusion model, evaluation code, and pretrained checkpoints**, to facilitate **reproducibility and future research** in representation-driven visual generation.
70
+
71
+
72
+ ---
73
+
74
+ ## ๐ŸŒŸ Key Components
75
+
76
+ | Component | Description |
77
+ | :--- | :--- |
78
+ | **1. SVG Autoencoder** | A novel latent codec consisting of a **Frozen VFM (DINOv3/DINOv2/SIGLIP2/MAE)** encoder, an optional residual reconstruction branch, and a trainable convolutional decoder. <br>โŒ No Quantization <br>โŒ No KL-loss <br>โŒ No Gaussian Assumption |
79
+ | **2. Latent Diffusion** | A **Single-stream Diffusion Transformer** trained directly on representation space. Supports progressive training (256โ†’512โ†’1024) and is optimized on large-scale text-image pairs. |
80
+
81
+ ---
82
+
83
+
84
+ ## ๐ŸŽฎ Model Zoo
85
+
86
+
87
+ ### **SVG Autoencoder**
88
+
89
+ | Model | Notes | Resol. | Encoder (Params) | Download URL |
90
+ | ----- | ----- | ------ | ---------------- | ------------ |
91
+ | Autoencoder-P | Stage1 (Low-resol) | 256 | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) (29M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
92
+ | Autoencoder-P | Stage2 (Middle-resol) | 512 | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) (29M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
93
+ | Autoencoder-P | Stage3 (High-resol) (๐Ÿ˜„ **Default**) | 1024 | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) (29M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
94
+ | Autoencoder-R | Stage1 (Low-resol) | 256 | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) + [Residual ViT](https://huggingface.co/KlingTeam/SVG-T2I) (29M + 22M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
95
+ | Autoencoder-R | Stage2 (Middle-resol) | 512 | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) + [Residual ViT](https://huggingface.co/KlingTeam/SVG-T2I) (29M + 22M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
96
+ | Autoencoder-R | Stage3 (High-resol) | 1024 | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) + [Residual ViT](https://huggingface.co/KlingTeam/SVG-T2I) (29M + 22M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
97
+
98
+
99
+ ### **SVG-T2I DiT**
100
+
101
+ | Notes | Resol. | Parameter | Text Encoder | Representation Encoder | Download URL |
102
+ | ----- | ---------- | --------- | ------------ | -------------------- | ------------- |
103
+ |Stage1 (Low-resol)| 256 | 2.6B | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b) | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
104
+ |Stage2 (Middle-resol)| 512 | 2.6B | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b) | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
105
+ |Stage3 (High-resol)| 1024 | 2.6B | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b) | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
106
+ |Stage4 (SFT)(๐Ÿ˜„**Default**)| 1024 | 2.6B | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b) | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
107
+
108
+
109
+
110
+ ---
111
+
112
+
113
+ ## ๐Ÿ› ๏ธ Installation
114
+
115
+ ### 1\. Environment Setup
116
+
117
+ ```bash
118
+ conda create -n svg_t2i python=3.10 -y
119
+ conda activate svg_t2i
120
+ pip install -r requirements.txt
121
+ ```
122
+
123
+ ### 2\. Download DINOv3
124
+
125
+ SVG-T2I relies on DINOv3 as the frozen encoder.
126
+
127
+ ```bash
128
+ # Download DINOv3 pretrained weights and update your config paths
129
+ git clone https://github.com/facebookresearch/dinov3.git
130
+ ```
131
+
132
+ ### 2. Download Pre-trained Models
133
+
134
+ You can download **all stage-wise pretrained models and checkpoints** from our official **Hugging Face repository**, including the **SVG autoencoder** and **SVG-T2I diffusion models** used for training and evaluation:
135
+
136
+ ```bash
137
+ https://huggingface.co/KlingTeam/SVG-T2I
138
+ ````
139
+
140
+ These pretrained weights are released to support **academic research, benchmarking, and a wide range of downstream applications**, and can be freely used for **experimentation, analysis, and further development**.
141
+
142
+ -----
143
+
144
+ ## ๐Ÿ“ฆ Data Preparation
145
+
146
+ ### 1\. Autoencoder Training Data
147
+
148
+ Any large-scale image dataset works (e.g., ImageNet-1K). Update `autoencoder/pure/configs/*.yaml`:
149
+
150
+ For ImageNet-1K
151
+
152
+ ```yaml
153
+ data:
154
+ target: "utils.data_module_allinone.DataModuleFromConfig"
155
+ params:
156
+ batch_size: 64
157
+ wrap: true
158
+ num_workers: 16
159
+ train:
160
+ target: ldm.data.imagenet.ImageNetTrain
161
+ params:
162
+ data_root: Your ImageNet Path
163
+ size: 256
164
+ validation:
165
+ target: ldm.data.imagenet.ImageNetValidation
166
+ params:
167
+ data_root: Your ImageNet Path
168
+ size: 256
169
+ ```
170
+
171
+ For customized Dataset
172
+
173
+ We support customized **JSONL** formats. Example file in `configs/example.jsonl`
174
+
175
+ (`prompt` only used in Generation Task).
176
+
177
+ **Example JSONL Format:**
178
+
179
+ ```json
180
+ {"path": "test/man.jpg", "prompt": "A man"}
181
+ {"path": "test/man.jpg", "prompt": "A man"}
182
+ ...
183
+ ```
184
+
185
+
186
+ ```yaml
187
+ data:
188
+ target: utils.data_module_allinone.DataModuleFromConfigJson
189
+ params:
190
+ batch_size: 3 # batch size per GPU
191
+ wrap: true
192
+ train_resol: 256
193
+ json_path: configs/example.jsonl
194
+
195
+ ```
196
+
197
+
198
+ ### 2\. Text-to-Image Training Data
199
+
200
+ **Example JSONL Format:**
201
+
202
+ ```json
203
+ {"path": "test/man.jpg", "prompt": "A man"}
204
+ {"path": "test/man.jpg", "prompt": "A man"}
205
+ ...
206
+ ```
207
+
208
+ -----
209
+
210
+ ## ๐Ÿš€ Training
211
+
212
+ SVG-T2I training is divided into two distinct stages.
213
+
214
+ ### Stage 1: Train SVG Autoencoder
215
+
216
+ Navigate to the `autoencoder` directory and launch training:
217
+
218
+ ```bash
219
+ cd autoencoder
220
+ bash run_train.sh <GPU NUM> configs/pure/svg_autoencoder_P_dd_M_IN_stage1_bs64_256_gpu1_forTest
221
+ # example
222
+ bash run_train.sh 1 configs/pure/svg_autoencoder_P_dd_M_IN_stage1_bs64_256_gpu1_forTest
223
+ ````
224
+
225
+ * **Output:** Results will be saved in `autoencoder/logs`.
226
+ * **Note:** You can modify training hyperparameters and output paths directly inside `run_train.sh` or the configuration YAML file.
227
+
228
+ ### Stage 2: Train SVG-DiT (Diffusion)
229
+
230
+ Navigate to `svg_t2i`. We provide scripts for both single-node and multi-node training.
231
+
232
+ **Single Node Example:**
233
+
234
+ ```bash
235
+ cd svg_t2i
236
+ bash scripts/run_train_1gpus_forTest.sh <RANK ID>
237
+ # example
238
+ bash scripts/run_train_1gpus_forTest.sh 0
239
+ ```
240
+
241
+ **Multi-Node Example:**
242
+
243
+ ```bash
244
+ bash scripts/run_train_mnodes.sh 0
245
+ bash scripts/run_train_mnodes.sh 1
246
+ bash scripts/run_train_mnodes.sh 2
247
+ bash scripts/run_train_mnodes.sh 3
248
+ ```
249
+
250
+ * **Output:** Results will be saved in `svg_t2i/results`.
251
+ * **Note:** You can adjust learning rates, batch sizes, number of GPUs, and save directories directly in the training scripts.
252
+
253
+ ---
254
+
255
+ ## ๐ŸŽจ Inference & Image Generation
256
+
257
+ Generate images using a pretrained **SVG-DiT** model.
258
+
259
+ > After downloading the pretrained checkpoints, you will obtain a `pre-trained/` directory.
260
+ > Please place this directory under the `svg_t2i/` folder before running inference.
261
+
262
+ ```bash
263
+ cd svg_t2i
264
+ bash scripts/sample.sh
265
+ ```
266
+
267
+ * **Output:** Results will be saved in `svg_t2i/samples`.
268
+ * **Note:** You can modify sampling parameters, prompt settings, and output directories directly inside `sample.sh`.
269
+
270
+
271
+
272
+
273
+
274
+ ## ๐Ÿ“ Citation
275
+
276
+ If you find this work helpful, please cite our papers:
277
+
278
+ ```bibtex
279
+ @misc{svg_t2i2025,
280
+ title={SVG-T2I: Scaling up Text-to-Image Latent Diffusion Model Without Variational Autoencoder},
281
+ author={Minglei Shi and Haolin Wang and Borui Zhang and Wenzhao Zheng and Bohan Zeng and
282
+ Ziyang Yuan and Xiaoshi Wu and Yuanxing Zhang and Huan Yang and Xintao Wang and
283
+ Pengfei Wan and Kun Gai and Jie Zhou and Jiwen Lu},
284
+ year={2025},
285
+ eprint={xxxx.xxxxx},
286
+ archivePrefix={arXiv},
287
+ primaryClass={cs.CV}
288
+ }
289
+
290
+ @misc{shi2025latentdiffusionmodelvariational,
291
+ title={Latent Diffusion Model without Variational Autoencoder},
292
+ author={Minglei Shi and Haolin Wang and Wenzhao Zheng and Ziyang Yuan and Xiaoshi Wu and Xintao Wang and Pengfei Wan and Jie Zhou and Jiwen Lu},
293
+ year={2025},
294
+ eprint={2510.15301},
295
+ archivePrefix={arXiv},
296
+ primaryClass={cs.CV}
297
+ }
298
+ ```
299
+
300
+ -----
301
+
302
+ ## ๐Ÿ’ก Acknowledgments
303
+
304
+ SVG-T2I builds upon the giants of the open-source community:
305
+
306
+ * **[SVG](https://github.com/shiml20/SVG)**: Base pipeline and core idea
307
+ * **[Lumina-Image-2.0](https://github.com/Alpha-VLLM/Lumina-T2X)**: DiT architecture and base training code.
308
+ * **[DINOv3](https://github.com/facebookresearch/dinov3)**: State-of-the-art semantic representation encoder.
309
+
310
+ For any questions, please open a [GitHub Issue](https://www.google.com/search?q=https://github.com/KlingTeam/SVG-T2I/issues).