gitcat404 commited on
Commit
52e5940
·
verified ·
1 Parent(s): c2c1a57

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +227 -0
README.md ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen2.5-VL-7B-Instruct
4
+ base_model_relation: finetune
5
+ language:
6
+ - en
7
+ pipeline_tag: image-text-to-text
8
+ library_name: transformers
9
+ tags:
10
+ - svg
11
+ - text-to-svg
12
+ - vision-language-model
13
+ - code-generation
14
+ - introspective
15
+ - generator-critic
16
+ - vlm
17
+ - qwen2.5-vl
18
+ - cvpr2026
19
+ datasets:
20
+ - gitcat404/IntroSVG-train
21
+ ---
22
+
23
+ # IntroSVG-Qwen2.5-VL-7B
24
+
25
+ <div align="center">
26
+
27
+ **Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator–Critic Framework**
28
+
29
+ *Accepted by CVPR 2026* 🎉
30
+
31
+ [![arXiv](https://img.shields.io/badge/arXiv-2603.09312-B31B1B?style=flat&logo=arxiv&logoColor=white)](https://arxiv.org/pdf/2603.09312)
32
+ [![GitHub](https://img.shields.io/badge/GitHub-IntroSVG-black?style=flat&logo=github)](https://github.com/gitcat-404/IntroSVG)
33
+ [![Dataset](https://img.shields.io/badge/Dataset-IntroSVG--train-yellow?style=flat&logo=huggingface)](https://huggingface.co/datasets/gitcat404/IntroSVG-train)
34
+
35
+ </div>
36
+
37
+ ---
38
+
39
+ ## Model Summary
40
+
41
+ **IntroSVG-Qwen2.5-VL-7B** is an end-to-end vision-language model that generates high-quality **SVG (Scalable Vector Graphics) code** directly from natural language descriptions. The model is fine-tuned from **Qwen2.5-VL-7B-Instruct** through a multi-stage training pipeline that combines supervised fine-tuning (SFT), curriculum learning, chain-of-thought (CoT) reasoning, and direct preference optimization (DPO).
42
+
43
+ The defining feature of IntroSVG is its **introspective generator–critic framework**: a single unified model alternates between two roles — *generator* (producing SVG code) and *critic* (rendering and evaluating its own output) — enabling an iterative *generate → evaluate → refine* loop at inference time.
44
+
45
+ ## Model Details
46
+
47
+ | Property | Value |
48
+ |---|---|
49
+ | **Base model** | [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) |
50
+ | **Parameters** | ~7B |
51
+ | **Architecture** | Vision-Language Model (VLM) |
52
+ | **Modalities (input)** | Text prompts and rendered SVG images (during the critique stage) |
53
+ | **Modality (output)** | SVG source code |
54
+ | **Training data** | SVG-1M (custom corpus, ~1M samples) |
55
+ | **Training paradigm** | SFT → DPO with curriculum learning and CoT |
56
+ | **License** | Apache 2.0 |
57
+
58
+ ## Method Overview
59
+
60
+ The model is built through three core stages:
61
+
62
+ ### 1. Data Construction
63
+ A mixed corpus is synthesized using an early-checkpoint model and a teacher VLM, comprising three subsets:
64
+ - **Direct generation** ($\mathcal{D}_G^{\text{direct}}$) — text-to-SVG pairs
65
+ - **Correction** ($\mathcal{D}_G^{\text{correction}}$) — flawed SVGs paired with refinements
66
+ - **Critique** ($\mathcal{D}_C$) — rendered SVGs paired with critique feedback
67
+
68
+ ### 2. Supervised Fine-Tuning (SFT)
69
+ A unified VLM is trained on the mixed dataset, simultaneously acquiring:
70
+ - **SVG generation capability**
71
+ - **SVG critique capability**
72
+
73
+ ### 3. Direct Preference Optimization (DPO)
74
+ A teacher VLM scores generated preference pairs, which are used to further optimize the generator policy $M_{\text{Policy}}$ via the DPO loss.
75
+
76
+ ### Introspective Inference Loop
77
+ At inference time, the same model performs a closed-loop introspective process:
78
+ 1. **Generate** an initial SVG from the prompt.
79
+ 2. Switch to the **critic role**: render the SVG and evaluate it.
80
+ 3. Assign a **quality score** based on the critique.
81
+ 4. If unsatisfactory, use the critique to guide the **next round of correction**.
82
+
83
+ This loop allows the model to refine its outputs iteratively without any external evaluator.
84
+
85
+ ## Intended Use
86
+
87
+ ### Primary use cases
88
+ - **Text-to-SVG generation** for icons, simple illustrations, logos, diagrams, and UI elements
89
+ - **Programmatic vector graphics design** as a creative co-pilot
90
+ - **Research** on vision-language reasoning, code generation, and self-refinement methods
91
+
92
+ ### Out-of-scope use
93
+ - The model is not intended for generating photorealistic raster images.
94
+ - It is not optimized for generating extremely complex artwork or production-ready brand assets without human review.
95
+ - It should not be used to produce misleading, infringing, or otherwise harmful imagery.
96
+
97
+ ## How to Use
98
+
99
+ ### Installation
100
+
101
+ ```bash
102
+ # 1. Clone the repository
103
+ git clone https://github.com/gitcat-404/IntroSVG.git
104
+ cd IntroSVG
105
+
106
+ # 2. Create environment
107
+ conda create -n introsvg python=3.10 -y
108
+ conda activate introsvg
109
+
110
+ # 3. System dependency for cairosvg (Linux)
111
+ sudo apt update
112
+ sudo apt install libcairo2 libcairo2-dev
113
+
114
+ # 4. Python dependencies
115
+ pip install torch==2.5.1+cu124 torchvision==0.20.0+cu124 \
116
+ --index-url https://download.pytorch.org/whl/cu124
117
+ pip install -r requirements.txt
118
+ ```
119
+
120
+ ### Download model weights
121
+
122
+ ```bash
123
+ pip install huggingface_hub
124
+ hf download gitcat404/IntroSVG-Qwen2.5-VL-7B \
125
+ --local-dir Models/IntroSVG-Qwen2.5-VL-7B
126
+ ```
127
+
128
+ ### Inference (recommended: lmdeploy server)
129
+
130
+ We recommend serving the model with [lmdeploy](https://github.com/InternLM/lmdeploy) for accelerated inference. Example with 4 GPUs:
131
+
132
+ ```bash
133
+ CUDA_VISIBLE_DEVICES=0,1,2,3 lmdeploy serve api_server \
134
+ "Models/IntroSVG-Qwen2.5-VL-7B" \
135
+ --tp 4 \
136
+ --server-port 23333
137
+ ```
138
+
139
+ Then run the introspective inference loop on a CSV of prompts:
140
+
141
+ ```bash
142
+ python inference_loop.py \
143
+ --MODEL_NAME Models/IntroSVG-Qwen2.5-VL-7B \
144
+ --CSV_FILE example/test.csv \
145
+ --OUTPUT_DIR your_output_folder
146
+ ```
147
+
148
+ An example prompt file is provided at `example/test.csv` in the GitHub repository — each row contains one text prompt for SVG generation.
149
+
150
+ ### Quick start with `transformers`
151
+
152
+ ```python
153
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
154
+
155
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
156
+ "gitcat404/IntroSVG-Qwen2.5-VL-7B",
157
+ torch_dtype="auto",
158
+ device_map="auto",
159
+ )
160
+ processor = AutoProcessor.from_pretrained("gitcat404/IntroSVG-Qwen2.5-VL-7B")
161
+
162
+ prompt = "A minimalist red apple with a green leaf."
163
+ messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]
164
+
165
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
166
+ inputs = processor(text=[text], return_tensors="pt").to(model.device)
167
+
168
+ output_ids = model.generate(**inputs, max_new_tokens=2048)
169
+ svg_code = processor.batch_decode(
170
+ output_ids[:, inputs.input_ids.shape[1]:],
171
+ skip_special_tokens=True,
172
+ )[0]
173
+ print(svg_code)
174
+ ```
175
+
176
+ > 💡 To unlock the full **introspective refinement loop** (generate → render → critique → correct), please use `inference_loop.py` from the official repository — it handles SVG rendering and feeds the rendered image back to the model in its critic role.
177
+
178
+ ## Training
179
+
180
+ All experiments were conducted on **8 × NVIDIA A800 GPUs**, using the [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) training pipeline.
181
+
182
+ Required artifacts:
183
+ - Base model: [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
184
+ - Training data: [SVG-1M-Json](https://huggingface.co/datasets/gitcat-404/SVG-1M-Json)
185
+
186
+ Place the data under `LLaMA-Factory/data/` and launch training with:
187
+
188
+ ```bash
189
+ sh train_sft.sh
190
+ ```
191
+
192
+ For DPO and the full multi-stage recipe, please refer to the scripts and configs in the [official repository](https://github.com/gitcat-404/IntroSVG).
193
+
194
+ ## Limitations
195
+
196
+ - **Visual complexity ceiling.** Highly intricate scenes, dense compositions, or fine-grained textures remain difficult to express in SVG and may produce simplified outputs.
197
+ - **Text rendering inside SVGs** can be imperfect (font substitution, kerning artifacts).
198
+ - **Latency.** The introspective loop trades inference time for quality; single-pass generation is faster but less polished.
199
+ - **Language coverage.** Training prompts are predominantly English; performance on other languages may degrade.
200
+ - **Rendering dependency.** The critic stage requires a working `cairosvg` / Cairo installation to rasterize intermediate SVGs.
201
+
202
+ ## Citation
203
+
204
+ If you find IntroSVG useful in your research, please cite our paper:
205
+
206
+ ```bibtex
207
+ @inproceedings{introsvg2026,
208
+ title = {IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation
209
+ via an Introspective Generator--Critic Framework},
210
+ author = {Anonymous Authors},
211
+ booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision
212
+ and Pattern Recognition (CVPR)},
213
+ year = {2026}
214
+ }
215
+ ```
216
+
217
+ ## Acknowledgements
218
+
219
+ This work builds on the excellent open-source ecosystem around:
220
+ - [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) — base vision-language model
221
+ - [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) — training framework
222
+ - [lmdeploy](https://github.com/InternLM/lmdeploy) — inference acceleration
223
+ - [cairosvg](https://cairosvg.org/) — SVG rasterization
224
+
225
+ ## License
226
+
227
+ This model is released under the **Apache 2.0** license. Please ensure your use of the model also complies with the license terms of the underlying [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) base model.