Instructions to use gitcat404/IntroSVG-Qwen2.5-VL-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use gitcat404/IntroSVG-Qwen2.5-VL-7B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="gitcat404/IntroSVG-Qwen2.5-VL-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("gitcat404/IntroSVG-Qwen2.5-VL-7B")
model = AutoModelForImageTextToText.from_pretrained("gitcat404/IntroSVG-Qwen2.5-VL-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use gitcat404/IntroSVG-Qwen2.5-VL-7B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "gitcat404/IntroSVG-Qwen2.5-VL-7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gitcat404/IntroSVG-Qwen2.5-VL-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/gitcat404/IntroSVG-Qwen2.5-VL-7B

SGLang

How to use gitcat404/IntroSVG-Qwen2.5-VL-7B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "gitcat404/IntroSVG-Qwen2.5-VL-7B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gitcat404/IntroSVG-Qwen2.5-VL-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "gitcat404/IntroSVG-Qwen2.5-VL-7B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gitcat404/IntroSVG-Qwen2.5-VL-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use gitcat404/IntroSVG-Qwen2.5-VL-7B with Docker Model Runner:
```
docker model run hf.co/gitcat404/IntroSVG-Qwen2.5-VL-7B
```

IntroSVG-Qwen2.5-VL-7B / README.md

gitcat404

Update README.md

5da60d6 verified about 1 month ago

preview code

raw

history blame contribute delete

8.97 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen2.5-VL-7B-Instruct
	base_model_relation: finetune
	language:
	- en
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- svg
	- text-to-svg
	- vision-language-model
	- code-generation
	- introspective
	- generator-critic
	- vlm
	- qwen2.5-vl
	- cvpr2026
	datasets:
	- gitcat404/IntroSVG-train
	---

	# IntroSVG-Qwen2.5-VL-7B

	<div align="center">

	Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator–Critic Framework

	Accepted by CVPR 2026 🎉

	[![arXiv](https://img.shields.io/badge/arXiv-2603.09312-B31B1B?style=flat&logo=arxiv&logoColor=white)](https://arxiv.org/pdf/2603.09312)
	[![GitHub](https://img.shields.io/badge/GitHub-IntroSVG-black?style=flat&logo=github)](https://github.com/gitcat-404/IntroSVG)
	[![Dataset](https://img.shields.io/badge/Dataset-IntroSVG--train-yellow?style=flat&logo=huggingface)](https://huggingface.co/datasets/gitcat404/IntroSVG-train)

	</div>

	---

	## Model Summary

	IntroSVG-Qwen2.5-VL-7B is an end-to-end vision-language model that generates high-quality SVG (Scalable Vector Graphics) code directly from natural language descriptions. The model is fine-tuned from Qwen2.5-VL-7B-Instruct through a multi-stage training pipeline that combines supervised fine-tuning (SFT), curriculum learning, chain-of-thought (CoT) reasoning, and direct preference optimization (DPO).

	The defining feature of IntroSVG is its introspective generator–critic framework: a single unified model alternates between two roles — generator (producing SVG code) and critic (rendering and evaluating its own output) — enabling an iterative generate → evaluate → refine loop at inference time.

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Base model \| [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) \|
	\| Parameters \| ~7B \|
	\| Architecture \| Vision-Language Model (VLM) \|
	\| Modalities (input) \| Text prompts and rendered SVG images (during the critique stage) \|
	\| Modality (output) \| SVG source code \|
	\| Training data \| SVG-1M (custom corpus, ~1M samples) \|
	\| Training paradigm \| SFT → DPO with curriculum learning and CoT \|
	\| License \| Apache 2.0 \|

	## Method Overview

	The model is built through three core stages:

	### 1. Data Construction
	A mixed corpus is synthesized using an early-checkpoint model and a teacher VLM, comprising three subsets:
	- Direct generation ($\mathcal{D}_G^{\text{direct}}$) — text-to-SVG pairs
	- Correction ($\mathcal{D}_G^{\text{correction}}$) — flawed SVGs paired with refinements
	- Critique ($\mathcal{D}_C$) — rendered SVGs paired with critique feedback

	### 2. Supervised Fine-Tuning (SFT)
	A unified VLM is trained on the mixed dataset, simultaneously acquiring:
	- SVG generation capability
	- SVG critique capability

	### 3. Direct Preference Optimization (DPO)
	A teacher VLM scores generated preference pairs, which are used to further optimize the generator policy $M_{\text{Policy}}$ via the DPO loss.

	### Introspective Inference Loop
	At inference time, the same model performs a closed-loop introspective process:
	1. Generate an initial SVG from the prompt.
	2. Switch to the critic role: render the SVG and evaluate it.
	3. Assign a quality score based on the critique.
	4. If unsatisfactory, use the critique to guide the next round of correction.

	This loop allows the model to refine its outputs iteratively without any external evaluator.

	## Intended Use

	### Primary use cases
	- Text-to-SVG generation for icons, simple illustrations, logos, diagrams, and UI elements
	- Programmatic vector graphics design as a creative co-pilot
	- Research on vision-language reasoning, code generation, and self-refinement methods

	### Out-of-scope use
	- The model is not intended for generating photorealistic raster images.
	- It is not optimized for generating extremely complex artwork or production-ready brand assets without human review.
	- It should not be used to produce misleading, infringing, or otherwise harmful imagery.

	## How to Use

	### Installation

	```bash
	# 1. Clone the repository
	git clone https://github.com/gitcat-404/IntroSVG.git
	cd IntroSVG

	# 2. Create environment
	conda create -n introsvg python=3.10 -y
	conda activate introsvg

	# 3. System dependency for cairosvg (Linux)
	sudo apt update
	sudo apt install libcairo2 libcairo2-dev

	# 4. Python dependencies
	pip install torch==2.5.1+cu124 torchvision==0.20.0+cu124 \
	--index-url https://download.pytorch.org/whl/cu124
	pip install -r requirements.txt
	```

	### Download model weights

	```bash
	pip install huggingface_hub
	hf download gitcat404/IntroSVG-Qwen2.5-VL-7B \
	--local-dir Models/IntroSVG-Qwen2.5-VL-7B
	```

	### Inference (recommended: lmdeploy server)

	We recommend serving the model with [lmdeploy](https://github.com/InternLM/lmdeploy) for accelerated inference. Example with 4 GPUs:

	```bash
	CUDA_VISIBLE_DEVICES=0,1,2,3 lmdeploy serve api_server \
	"Models/IntroSVG-Qwen2.5-VL-7B" \
	--tp 4 \
	--server-port 23333
	```

	Then run the introspective inference loop on a CSV of prompts:

	```bash
	python inference_loop.py \
	--MODEL_NAME Models/IntroSVG-Qwen2.5-VL-7B \
	--CSV_FILE example/test.csv \
	--OUTPUT_DIR your_output_folder
	```

	An example prompt file is provided at `example/test.csv` in the GitHub repository — each row contains one text prompt for SVG generation.

	### Quick start with `transformers`

	```python
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor

	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	"gitcat404/IntroSVG-Qwen2.5-VL-7B",
	torch_dtype="auto",
	device_map="auto",
	)
	processor = AutoProcessor.from_pretrained("gitcat404/IntroSVG-Qwen2.5-VL-7B")

	prompt = "A minimalist red apple with a green leaf."
	messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]

	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = processor(text=[text], return_tensors="pt").to(model.device)

	output_ids = model.generate(**inputs, max_new_tokens=2048)
	svg_code = processor.batch_decode(
	output_ids[:, inputs.input_ids.shape[1]:],
	skip_special_tokens=True,
	)[0]
	print(svg_code)
	```

	> 💡 To unlock the full introspective refinement loop (generate → render → critique → correct), please use `inference_loop.py` from the official repository — it handles SVG rendering and feeds the rendered image back to the model in its critic role.

	## Training

	All experiments were conducted on 8 × NVIDIA A800 GPUs, using the [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) training pipeline.

	Required artifacts:
	- Base model: [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
	- Training data: [SVG-1M-Json](https://huggingface.co/datasets/gitcat-404/SVG-1M-Json)

	Place the data under `LLaMA-Factory/data/` and launch training with:

	```bash
	sh train_sft.sh
	```

	For DPO and the full multi-stage recipe, please refer to the scripts and configs in the [official repository](https://github.com/gitcat-404/IntroSVG).

	## Limitations

	- Visual complexity ceiling. Highly intricate scenes, dense compositions, or fine-grained textures remain difficult to express in SVG and may produce simplified outputs.
	- Text rendering inside SVGs can be imperfect (font substitution, kerning artifacts).
	- Latency. The introspective loop trades inference time for quality; single-pass generation is faster but less polished.
	- Language coverage. Training prompts are predominantly English; performance on other languages may degrade.
	- Rendering dependency. The critic stage requires a working `cairosvg` / Cairo installation to rasterize intermediate SVGs.

	## Citation

	If you find IntroSVG useful in your research, please cite our paper:

	```bibtex
	@article{wang2026introsvg,
	title = {IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation
	via an Introspective Generator-Critic Framework},
	author = {Wang, Feiyu and Yang, Jiayuan and Zhao, Zhiyuan and Zhang, Da and
	Li, Bingyu and Liu, Peng and Gao, Junyu},
	journal = {arXiv preprint arXiv:2603.09312},
	year = {2026}
	}
	```

	## Acknowledgements

	This work builds on the excellent open-source ecosystem around:
	- [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) — base vision-language model
	- [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) — training framework
	- [lmdeploy](https://github.com/InternLM/lmdeploy) — inference acceleration
	- [cairosvg](https://cairosvg.org/) — SVG rasterization

	## License

	This model is released under the Apache 2.0 license. Please ensure your use of the model also complies with the license terms of the underlying [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) base model.