Instructions to use gitcat404/IntroSVG-Qwen2.5-VL-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use gitcat404/IntroSVG-Qwen2.5-VL-7B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="gitcat404/IntroSVG-Qwen2.5-VL-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("gitcat404/IntroSVG-Qwen2.5-VL-7B")
model = AutoModelForImageTextToText.from_pretrained("gitcat404/IntroSVG-Qwen2.5-VL-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use gitcat404/IntroSVG-Qwen2.5-VL-7B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "gitcat404/IntroSVG-Qwen2.5-VL-7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gitcat404/IntroSVG-Qwen2.5-VL-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/gitcat404/IntroSVG-Qwen2.5-VL-7B

SGLang

How to use gitcat404/IntroSVG-Qwen2.5-VL-7B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "gitcat404/IntroSVG-Qwen2.5-VL-7B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gitcat404/IntroSVG-Qwen2.5-VL-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "gitcat404/IntroSVG-Qwen2.5-VL-7B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gitcat404/IntroSVG-Qwen2.5-VL-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use gitcat404/IntroSVG-Qwen2.5-VL-7B with Docker Model Runner:
```
docker model run hf.co/gitcat404/IntroSVG-Qwen2.5-VL-7B
```

gitcat404 commited on Apr 19

Commit

52e5940

verified ·

1 Parent(s): c2c1a57

Create README.md

Browse files

Files changed (1) hide show

README.md +227 -0

README.md ADDED Viewed

	@@ -0,0 +1,227 @@

+---
+license: apache-2.0
+base_model: Qwen/Qwen2.5-VL-7B-Instruct
+base_model_relation: finetune
+language:
+- en
+pipeline_tag: image-text-to-text
+library_name: transformers
+tags:
+- svg
+- text-to-svg
+- vision-language-model
+- code-generation
+- introspective
+- generator-critic
+- vlm
+- qwen2.5-vl
+- cvpr2026
+datasets:
+- gitcat404/IntroSVG-train
+---
+# IntroSVG-Qwen2.5-VL-7B
+<div align="center">
+**Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator–Critic Framework**
+*Accepted by CVPR 2026* 🎉
+[![arXiv](https://img.shields.io/badge/arXiv-2603.09312-B31B1B?style=flat&logo=arxiv&logoColor=white)](https://arxiv.org/pdf/2603.09312)
+[![GitHub](https://img.shields.io/badge/GitHub-IntroSVG-black?style=flat&logo=github)](https://github.com/gitcat-404/IntroSVG)
+[![Dataset](https://img.shields.io/badge/Dataset-IntroSVG--train-yellow?style=flat&logo=huggingface)](https://huggingface.co/datasets/gitcat404/IntroSVG-train)
+</div>
+---
+## Model Summary
+**IntroSVG-Qwen2.5-VL-7B** is an end-to-end vision-language model that generates high-quality **SVG (Scalable Vector Graphics) code** directly from natural language descriptions. The model is fine-tuned from **Qwen2.5-VL-7B-Instruct** through a multi-stage training pipeline that combines supervised fine-tuning (SFT), curriculum learning, chain-of-thought (CoT) reasoning, and direct preference optimization (DPO).
+The defining feature of IntroSVG is its **introspective generator–critic framework**: a single unified model alternates between two roles — *generator* (producing SVG code) and *critic* (rendering and evaluating its own output) — enabling an iterative *generate → evaluate → refine* loop at inference time.
+## Model Details
+| Property | Value |
+|---|---|
+| **Base model** | [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) |
+| **Parameters** | ~7B |
+| **Architecture** | Vision-Language Model (VLM) |
+| **Modalities (input)** | Text prompts and rendered SVG images (during the critique stage) |
+| **Modality (output)** | SVG source code |
+| **Training data** | SVG-1M (custom corpus, ~1M samples) |
+| **Training paradigm** | SFT → DPO with curriculum learning and CoT |
+| **License** | Apache 2.0 |
+## Method Overview
+The model is built through three core stages:
+### 1. Data Construction
+A mixed corpus is synthesized using an early-checkpoint model and a teacher VLM, comprising three subsets:
+- **Direct generation** ($\mathcal{D}_G^{\text{direct}}$) — text-to-SVG pairs
+- **Correction** ($\mathcal{D}_G^{\text{correction}}$) — flawed SVGs paired with refinements
+- **Critique** ($\mathcal{D}_C$) — rendered SVGs paired with critique feedback
+### 2. Supervised Fine-Tuning (SFT)
+A unified VLM is trained on the mixed dataset, simultaneously acquiring:
+- **SVG generation capability**
+- **SVG critique capability**
+### 3. Direct Preference Optimization (DPO)
+A teacher VLM scores generated preference pairs, which are used to further optimize the generator policy $M_{\text{Policy}}$ via the DPO loss.
+### Introspective Inference Loop
+At inference time, the same model performs a closed-loop introspective process:
+1. **Generate** an initial SVG from the prompt.
+2. Switch to the **critic role**: render the SVG and evaluate it.
+3. Assign a **quality score** based on the critique.
+4. If unsatisfactory, use the critique to guide the **next round of correction**.
+This loop allows the model to refine its outputs iteratively without any external evaluator.
+## Intended Use
+### Primary use cases
+- **Text-to-SVG generation** for icons, simple illustrations, logos, diagrams, and UI elements
+- **Programmatic vector graphics design** as a creative co-pilot
+- **Research** on vision-language reasoning, code generation, and self-refinement methods
+### Out-of-scope use
+- The model is not intended for generating photorealistic raster images.
+- It is not optimized for generating extremely complex artwork or production-ready brand assets without human review.
+- It should not be used to produce misleading, infringing, or otherwise harmful imagery.
+## How to Use
+### Installation
+```bash
+# 1. Clone the repository
+git clone https://github.com/gitcat-404/IntroSVG.git
+cd IntroSVG
+# 2. Create environment
+conda create -n introsvg python=3.10 -y
+conda activate introsvg
+# 3. System dependency for cairosvg (Linux)
+sudo apt update
+sudo apt install libcairo2 libcairo2-dev
+# 4. Python dependencies
+pip install torch==2.5.1+cu124 torchvision==0.20.0+cu124 \
+    --index-url https://download.pytorch.org/whl/cu124
+pip install -r requirements.txt
+```
+### Download model weights
+```bash
+pip install huggingface_hub
+hf download gitcat404/IntroSVG-Qwen2.5-VL-7B \
+    --local-dir Models/IntroSVG-Qwen2.5-VL-7B
+```
+### Inference (recommended: lmdeploy server)
+We recommend serving the model with [lmdeploy](https://github.com/InternLM/lmdeploy) for accelerated inference. Example with 4 GPUs:
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3 lmdeploy serve api_server \
+    "Models/IntroSVG-Qwen2.5-VL-7B" \
+    --tp 4 \
+    --server-port 23333
+```
+Then run the introspective inference loop on a CSV of prompts:
+```bash
+python inference_loop.py \
+    --MODEL_NAME Models/IntroSVG-Qwen2.5-VL-7B \
+    --CSV_FILE example/test.csv \
+    --OUTPUT_DIR your_output_folder
+```
+An example prompt file is provided at `example/test.csv` in the GitHub repository — each row contains one text prompt for SVG generation.
+### Quick start with `transformers`
+```python
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    "gitcat404/IntroSVG-Qwen2.5-VL-7B",
+    torch_dtype="auto",
+    device_map="auto",
+)
+processor = AutoProcessor.from_pretrained("gitcat404/IntroSVG-Qwen2.5-VL-7B")
+prompt = "A minimalist red apple with a green leaf."
+messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]
+text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = processor(text=[text], return_tensors="pt").to(model.device)
+output_ids = model.generate(**inputs, max_new_tokens=2048)
+svg_code = processor.batch_decode(
+    output_ids[:, inputs.input_ids.shape[1]:],
+    skip_special_tokens=True,
+)[0]
+print(svg_code)
+```
+> 💡 To unlock the full **introspective refinement loop** (generate → render → critique → correct), please use `inference_loop.py` from the official repository — it handles SVG rendering and feeds the rendered image back to the model in its critic role.
+## Training
+All experiments were conducted on **8 × NVIDIA A800 GPUs**, using the [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) training pipeline.
+Required artifacts:
+- Base model: [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
+- Training data: [SVG-1M-Json](https://huggingface.co/datasets/gitcat-404/SVG-1M-Json)
+Place the data under `LLaMA-Factory/data/` and launch training with:
+```bash
+sh train_sft.sh
+```
+For DPO and the full multi-stage recipe, please refer to the scripts and configs in the [official repository](https://github.com/gitcat-404/IntroSVG).
+## Limitations
+- **Visual complexity ceiling.** Highly intricate scenes, dense compositions, or fine-grained textures remain difficult to express in SVG and may produce simplified outputs.
+- **Text rendering inside SVGs** can be imperfect (font substitution, kerning artifacts).
+- **Latency.** The introspective loop trades inference time for quality; single-pass generation is faster but less polished.
+- **Language coverage.** Training prompts are predominantly English; performance on other languages may degrade.
+- **Rendering dependency.** The critic stage requires a working `cairosvg` / Cairo installation to rasterize intermediate SVGs.
+## Citation
+If you find IntroSVG useful in your research, please cite our paper:
+```bibtex
+@inproceedings{introsvg2026,
+  title     = {IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation
+               via an Introspective Generator--Critic Framework},
+  author    = {Anonymous Authors},
+  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision
+               and Pattern Recognition (CVPR)},
+  year      = {2026}
+}
+```
+## Acknowledgements
+This work builds on the excellent open-source ecosystem around:
+- [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) — base vision-language model
+- [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) — training framework
+- [lmdeploy](https://github.com/InternLM/lmdeploy) — inference acceleration
+- [cairosvg](https://cairosvg.org/) — SVG rasterization
+## License
+This model is released under the **Apache 2.0** license. Please ensure your use of the model also complies with the license terms of the underlying [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) base model.