SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
Paper
•
2503.11576
•
Published
•
146
A fine-tuned Vision-Language Model for structured food and drink extraction from images. Given an input image, the model outputs a structured JSON containing food classification, image title, and extracted food/drink items.
| Attribute | Value |
|---|---|
| Base Model | SmolVLM2-500M-Video-Instruct |
| Training Method | Supervised Fine-Tuning (SFT) |
| Training Strategy | Vision Encoder Frozen, LLM & Cross-Modal Connector Trainable |
| Total Parameters | 507M |
| Trainable Parameters | 421M (83%) |
| Frozen Parameters | 86M (17%) |
| Precision | bfloat16 |
This model is designed for:
{
"is_food": 1,
"image_title": "macaron assortment",
"food_items": ["yellow macaron", "white macaron", "green macaron"],
"drink_items": []
}
from transformers import pipeline
import torch
# Load the fine-tuned model
pipe = pipeline(
"image-text-to-text",
model="CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune",
dtype=torch.bfloat16,
device_map="auto"
)
# Prepare input message
message = [{
"role": "user",
"content": [
{"type": "image", "image": your_image}, # PIL.Image object
{"type": "text", "text": """Classify the given input image into food or not and if edible food or drink items are present, extract those to a list. If no food/drink items are visible, return empty lists.
Only return valid JSON in the following form:
```json
{
'is_food': 0,
'image_title': '',
'food_items': [],
'drink_items': []
}
```} ] }]
| Split | Samples | Description |
|---|---|---|
| Train | 1,208 | 80% of total dataset |
| Validation | 302 | 20% of total dataset |
| Total | 1,510 | Food images (1k) + Non-food images (500) |
Dataset Source: mrdbourke/FoodExtract-1k-Vision
| Hyperparameter | Value |
|---|---|
| Epochs | 4 |
| Batch Size (per device) | 4 |
| Gradient Accumulation Steps | 4 |
| Effective Batch Size | 16 |
| Learning Rate | 2e-4 |
| LR Scheduler | Constant |
| Warmup Ratio | 0.03 |
| Optimizer | AdamW (fused) |
| Max Grad Norm | 1.0 |
| Precision | bf16 |
| Gradient Checkpointing | ✓ |
The Vision Encoder was frozen during training to:
This approach is inspired by the SmolDocling paper.
| Epoch | Training Loss | Validation Loss |
|---|---|---|
| 1 | 0.0842 | 0.0759 |
| 2 | 0.0816 | 0.0757 |
| 3 | 0.0237 | 0.0751 |
| 4 | 0.0172 | 0.0807 |
Final Training Loss: 0.0518
Try the model on Hugging Face Spaces:
The demo compares outputs from the base model vs. the fine-tuned model side-by-side.
| Library | Version |
|---|---|
| TRL | 0.27.1 |
| Transformers | 4.57.6 |
| PyTorch | 2.9.0+cu126 |
| Datasets | 4.0.0 |
| Tokenizers | 0.22.2 |
If you use this model, please cite:
@misc{foodextract-vision-2025,
title = {FoodExtract-Vision: Fine-tuned SmolVLM2 for Structured Food Extraction},
author = {Jarvis Zhang},
year = 2025,
publisher = {Hugging Face},
howpublished = {\url{[https://huggingface.co/CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune](https://huggingface.co/CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune)}}
}
Base model
HuggingFaceTB/SmolLM2-360M