slide2svg-vl-2b

A vision-language model fine-tuned to convert presentation slides (images) into structured SVG format.

Model Description

This model takes an image of a presentation slide and generates corresponding SVG markup, preserving:

  • Text content and positioning
  • Font styling (family, size, weight, color)
  • Image placements
  • Layout structure

Training Details

Parameter Value
Base Model Qwen3-VL-2B-Instruct
Dataset ahazimeh/slide2svg (~39,500 samples)
Method LoRA (r=64, alpha=128)
Epochs 2
Batch Size 16 (2 × 8 gradient accumulation)
Learning Rate 1e-4
Context Length 12,288 tokens
Precision 16-bit
Training Time ~5.5 hours
Hardware NVIDIA RTX 5090 (32GB)

Evaluation Results

Metric Score
Valid XML Structure 100% (20/20)
Has <svg> tag 100% (20/20)
Has <foreignObject> 100% (20/20)
Final Training Loss 0.091
Final Validation Loss 0.091
Avg Output Length 91% of reference

Usage

With Unsloth

from unsloth import FastVisionModel
from PIL import Image

# Load model
model, tokenizer = FastVisionModel.from_pretrained(
    "Bombek1/slide2svg-vl-2b",
    load_in_4bit=True,  # or False for 16-bit
)
FastVisionModel.for_inference(model)

# Prepare input
image = Image.open("your_slide.png")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Convert this presentation slide to SVG format."}
        ]
    }
]

# Generate
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
generated_ids = outputs[0][inputs["input_ids"].shape[1]:]
svg_output = tokenizer.decode(generated_ids, skip_special_tokens=True)

print(svg_output)

Output Format

The model generates SVG with the following structure:

<?xml version="1.0" encoding="utf-8"?>
<html>
<svg xmlns="http://www.w3.org/2000/svg" width="1024" height="768" fill="white">
  <image x="0.0%" y="0.0%" width="100.0%" href="background.png" />
  <g id="images">
    <image x="14.3%" y="28.5%" width="35.8%" href="image_0.png" />
  </g>
  <g id="text">
    <foreignObject x="5.4%" y="8.1%" width="32.0%" height="12.0%" overflow="visible">
      <div xmlns="http://www.w3.org/1999/xhtml" style="font-family: Inter; font-size: 74px; font-weight: bold; color: #000000;">
        <div>Your Text Here</div>
      </div>
    </foreignObject>
  </g>
</svg>
</html>

Limitations

  • Outputs are ~91% the length of reference SVGs (may omit some minor elements)
  • Default output resolution is 1024×768 (positions use percentages, so they scale)
  • Best results on slides similar to the training data (presentation-style layouts)
  • Complex diagrams or charts may not be fully captured

License

Apache 2.0 (inherited from Qwen3-VL base model)

Acknowledgments

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Bombek1/slide2svg-vl-2b

Adapter
(2)
this model

Dataset used to train Bombek1/slide2svg-vl-2b