KlingTeam
/

SVG-T2I

+<div align="center">
+<!-- <img src="assets/logo.png" width="400"/> -->
+# SVG-T2I: Scaling up Text-to-Image Latent Diffusion Model <br> Without Variational Autoencoder
+[![arXiv](https://img.shields.io/badge/arXiv-25xx.xxxxx-b31b1b.svg)](https://arxiv.org/abs/xxxx.xxxxx)
+[![Code](https://img.shields.io/badge/GitHub-SVG--T2I-black)](https://github.com/KlingTeam/SVG-T2I)
+[![Model Weights](https://img.shields.io/badge/Model-SVG--T2I-yellow)](https://huggingface.co/KlingTeam/SVG-T2I)
+[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
+[![arXiv](https://img.shields.io/badge/arXiv-SVG-b31b1b.svg)](https://arxiv.org/abs/2510.15301)
+[![Code](https://img.shields.io/badge/GitHub-SVG-black)](https://github.com/shiml20/SVG)
+[![Model Weights](https://img.shields.io/badge/Model-SVG-yellow)](https://huggingface.co/howlin/SVG)
+_**[Minglei Shi](https://github.com/shiml20)<sup>1*</sup>, [Haolin Wang](https://howlin-wang.github.io)<sup>1*</sup>, [Borui Zhang](https://boruizhang.site/)<sup>1</sup>, [Wenzhao Zheng](https://wzzheng.net)<sup>1</sup>, [Bohan Zeng](https://scholar.google.com/citations?user=MHo_d3YAAAAJ&hl=en)<sup>2</sup>**_
+_**[Ziyang Yuan](https://scholar.google.ru/citations?user=fWxWEzsAAAAJ&hl=en)<sup>2†</sup>, [Xiaoshi Wu](https://scholar.google.com/citations?user=cnOAMbUAAAAJ&hl=en)<sup>2</sup>, [Yuanxing Zhang](https://scholar.google.com/citations?user=COdftTMAAAAJ&hl=en)<sup>2</sup>, [Huan Yang](https://hyang0511.github.io/)<sup>2</sup>**_
+_**[Xintao Wang](https://xinntao.github.io/)<sup>2</sup>, [Pengfei Wan](https://magicwpf.github.io/)<sup>2</sup>, [Kun Gai](https://scholar.google.com/citations?user=PXO4ygEAAAAJ&hl=zh-CN)<sup>2</sup>, [Jie Zhou](https://scholar.google.com/citations?user=6a79aPwAAAAJ&hl=en)<sup>1</sup>, [Jiwen Lu](https://ivg.au.tsinghua.edu.cn/Jiwen_Lu/)<sup>1†</sup>**_
+<br>
+<sup>1</sup>Tsinghua University &nbsp;&nbsp; <sup>2</sup>KlingTeam, Kuaishou Technology
+<br>
+<small>* Equal contribution &nbsp;&nbsp; † Corresponding author</small>
+</div>
+---
+> **Important Note:** This repository implements SVG-T2I, a text-to-image diffusion framework that performs visual generation directly in Visual Foundation Model (VFM) representation space, rather than pixel space or vae space.
+>
+---
+## 📰 News
+- **[2025-12-13]** 📢✨ We are excited to announce the official release of **SVG-T2I**, including pre-trained checkpoints as well as complete training and inference code.
+## 🖼️ Gallery
+<div align="center">
+  <img src="assets/viz_t2i_1.png" width="80%" alt="Teaser Image"/>
+  <br>
+  <em>High-fidelity samples generated by SVG-T2I.</em>
+</div>
+---
+## 🧠 Overview
+Visual generation grounded in Visual Foundation Model (VFM) representations offers a promising unified approach to visual understanding and generation. However, large-scale text-to-image diffusion models operating directly in VFM feature space remain underexplored.
+To address this, SVG-T2I extends the SVG framework to enable high-quality text-to-image synthesis directly in the VFM domain using a standard diffusion pipeline. The model achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench, demonstrating the strong generative capability of VFM representations.
+We fully open-source the autoencoder and generation models, along with their training, inference, and evaluation pipelines, to support future research in representation-driven visual generation.
+### Why SVG-T2I?
+- **✨ Direct Use of VFM Representations:**
+  SVG-T2I performs generation **directly in the feature space of Visual Foundation Models (e.g., DINOv3)**, rather than aligning it. This preserves **rich semantic structure** learned from large-scale self-supervised visual representation learning.
+- **🔗 Unified Representation for Understanding and Generation:**
+  By **sharing the same VFM representation space** across **visual understanding, perception, and generation**, SVG-T2I unlocks strong potential for **downstream tasks** such as **image editing, retrieval, reasoning, and multimodal alignment**.
+- **🧩 Fully Open-Sourced Pipeline:**
+  We **fully open-source** the **entire training and inference pipeline**, including the **SVG autoencoder, diffusion model, evaluation code, and pretrained checkpoints**, to facilitate **reproducibility and future research** in representation-driven visual generation.
+---
+## 🌟 Key Components
+| Component | Description |
+| :--- | :--- |
+| **1. SVG Autoencoder** | A novel latent codec consisting of a **Frozen VFM (DINOv3/DINOv2/SIGLIP2/MAE)** encoder, an optional residual reconstruction branch, and a trainable convolutional decoder. <br>❌ No Quantization <br>❌ No KL-loss <br>❌ No Gaussian Assumption |
+| **2. Latent Diffusion** | A **Single-stream Diffusion Transformer** trained directly on representation space. Supports progressive training (256→512→1024) and is optimized on large-scale text-image pairs. |
+---
+## 🎮 Model Zoo
+### **SVG Autoencoder**
+| Model | Notes | Resol. | Encoder (Params) | Download URL |
+| ----- | ----- | ------ | ---------------- | ------------ |
+| Autoencoder-P | Stage1 (Low-resol) | 256  | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) (29M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
+| Autoencoder-P | Stage2 (Middle-resol) | 512  | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) (29M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
+| Autoencoder-P | Stage3 (High-resol) (😄 **Default**) | 1024 | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) (29M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
+| Autoencoder-R | Stage1 (Low-resol) | 256  | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) + [Residual ViT](https://huggingface.co/KlingTeam/SVG-T2I) (29M + 22M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
+| Autoencoder-R | Stage2 (Middle-resol) | 512  | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) + [Residual ViT](https://huggingface.co/KlingTeam/SVG-T2I) (29M + 22M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
+| Autoencoder-R | Stage3 (High-resol) | 1024 | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) + [Residual ViT](https://huggingface.co/KlingTeam/SVG-T2I) (29M + 22M) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
+### **SVG-T2I DiT**
+| Notes | Resol. | Parameter | Text Encoder | Representation Encoder | Download URL  |
+| ----- | ---------- | --------- | ------------ | -------------------- | ------------- |
+|Stage1 (Low-resol)| 256        | 2.6B      | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b)  | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
+|Stage2 (Middle-resol)| 512        | 2.6B      | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b)  | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
+|Stage3 (High-resol)| 1024       | 2.6B      | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b)  | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
+|Stage4 (SFT)(😄**Default**)| 1024       | 2.6B      | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b)  | [DINOv3s16p](https://github.com/facebookresearch/dinov3?tab=readme-ov-file) | [Hugging Face](https://huggingface.co/KlingTeam/SVG-T2I) |
+---
+## 🛠️ Installation
+### 1\. Environment Setup
+```bash
+conda create -n svg_t2i python=3.10 -y
+conda activate svg_t2i
+pip install -r requirements.txt
+```
+### 2\. Download DINOv3
+SVG-T2I relies on DINOv3 as the frozen encoder.
+```bash
+# Download DINOv3 pretrained weights and update your config paths
+git clone https://github.com/facebookresearch/dinov3.git
+```
+### 2. Download Pre-trained Models
+You can download **all stage-wise pretrained models and checkpoints** from our official **Hugging Face repository**, including the **SVG autoencoder** and **SVG-T2I diffusion models** used for training and evaluation:
+```bash
+https://huggingface.co/KlingTeam/SVG-T2I
+````
+These pretrained weights are released to support **academic research, benchmarking, and a wide range of downstream applications**, and can be freely used for **experimentation, analysis, and further development**.
+-----
+## 📦 Data Preparation
+### 1\. Autoencoder Training Data
+Any large-scale image dataset works (e.g., ImageNet-1K). Update `autoencoder/pure/configs/*.yaml`:
+For ImageNet-1K
+```yaml
+data:
+  target: "utils.data_module_allinone.DataModuleFromConfig"
+  params:
+    batch_size: 64
+    wrap: true
+    num_workers: 16
+    train:
+      target: ldm.data.imagenet.ImageNetTrain
+      params:
+        data_root: Your ImageNet Path
+        size: 256
+    validation:
+      target: ldm.data.imagenet.ImageNetValidation
+      params:
+        data_root: Your ImageNet Path
+        size: 256
+```
+For customized Dataset
+We support customized **JSONL** formats. Example file in `configs/example.jsonl`
+(`prompt` only used in Generation Task).
+**Example JSONL Format:**
+```json
+{"path": "test/man.jpg", "prompt": "A man"}
+{"path": "test/man.jpg", "prompt": "A man"}
+...
+```
+```yaml
+data:
+  target: utils.data_module_allinone.DataModuleFromConfigJson
+  params:
+    batch_size: 3 # batch size per GPU
+    wrap: true
+    train_resol: 256
+    json_path: configs/example.jsonl
+```
+### 2\. Text-to-Image Training Data
+**Example JSONL Format:**
+```json
+{"path": "test/man.jpg", "prompt": "A man"}
+{"path": "test/man.jpg", "prompt": "A man"}
+...
+```
+-----
+## 🚀 Training
+SVG-T2I training is divided into two distinct stages.
+### Stage 1: Train SVG Autoencoder
+Navigate to the `autoencoder` directory and launch training:
+```bash
+cd autoencoder
+bash run_train.sh <GPU NUM> configs/pure/svg_autoencoder_P_dd_M_IN_stage1_bs64_256_gpu1_forTest
+# example
+bash run_train.sh 1 configs/pure/svg_autoencoder_P_dd_M_IN_stage1_bs64_256_gpu1_forTest
+````
+* **Output:** Results will be saved in `autoencoder/logs`.
+* **Note:** You can modify training hyperparameters and output paths directly inside `run_train.sh` or the configuration YAML file.
+### Stage 2: Train SVG-DiT (Diffusion)
+Navigate to `svg_t2i`. We provide scripts for both single-node and multi-node training.
+**Single Node Example:**
+```bash
+cd svg_t2i
+bash scripts/run_train_1gpus_forTest.sh <RANK ID>
+# example
+bash scripts/run_train_1gpus_forTest.sh 0
+```
+**Multi-Node Example:**
+```bash
+bash scripts/run_train_mnodes.sh 0
+bash scripts/run_train_mnodes.sh 1
+bash scripts/run_train_mnodes.sh 2
+bash scripts/run_train_mnodes.sh 3
+```
+* **Output:** Results will be saved in `svg_t2i/results`.
+* **Note:** You can adjust learning rates, batch sizes, number of GPUs, and save directories directly in the training scripts.
+---
+## 🎨 Inference & Image Generation
+Generate images using a pretrained **SVG-DiT** model.
+> After downloading the pretrained checkpoints, you will obtain a `pre-trained/` directory.
+> Please place this directory under the `svg_t2i/` folder before running inference.
+```bash
+cd svg_t2i
+bash scripts/sample.sh
+```
+* **Output:** Results will be saved in `svg_t2i/samples`.
+* **Note:** You can modify sampling parameters, prompt settings, and output directories directly inside `sample.sh`.
+## 📝 Citation
+If you find this work helpful, please cite our papers:
+```bibtex
+@misc{svg_t2i2025,
+      title={SVG-T2I: Scaling up Text-to-Image Latent Diffusion Model Without Variational Autoencoder},
+      author={Minglei Shi and Haolin Wang and Borui Zhang and Wenzhao Zheng and Bohan Zeng and
+              Ziyang Yuan and Xiaoshi Wu and Yuanxing Zhang and Huan Yang and Xintao Wang and
+              Pengfei Wan and Kun Gai and Jie Zhou and Jiwen Lu},
+      year={2025},
+      eprint={xxxx.xxxxx},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+@misc{shi2025latentdiffusionmodelvariational,
+      title={Latent Diffusion Model without Variational Autoencoder},
+      author={Minglei Shi and Haolin Wang and Wenzhao Zheng and Ziyang Yuan and Xiaoshi Wu and Xintao Wang and Pengfei Wan and Jie Zhou and Jiwen Lu},
+      year={2025},
+      eprint={2510.15301},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
+-----
+## 💡 Acknowledgments
+SVG-T2I builds upon the giants of the open-source community:
+  * **[SVG](https://github.com/shiml20/SVG)**: Base pipeline and core idea
+  * **[Lumina-Image-2.0](https://github.com/Alpha-VLLM/Lumina-T2X)**: DiT architecture and base training code.
+  * **[DINOv3](https://github.com/facebookresearch/dinov3)**: State-of-the-art semantic representation encoder.
+For any questions, please open a [GitHub Issue](https://www.google.com/search?q=https://github.com/KlingTeam/SVG-T2I/issues).