FoodExtract-Vision-SmolVLM2-500M

A fine-tuned Vision-Language Model for structured food and drink extraction from images. Given an input image, the model outputs a structured JSON containing food classification, image title, and extracted food/drink items.

Model Description

Attribute Value
Base Model SmolVLM2-500M-Video-Instruct
Training Method Supervised Fine-Tuning (SFT)
Training Strategy Vision Encoder Frozen, LLM & Cross-Modal Connector Trainable
Total Parameters 507M
Trainable Parameters 421M (83%)
Frozen Parameters 86M (17%)
Precision bfloat16

Intended Use

This model is designed for:

  • 🍕 Food/Drink Classification: Determine if an image contains food or drinks
  • 📝 Structured Data Extraction: Extract food and drink items into JSON format
  • 🏷️ Image Captioning: Generate food-related titles for images

Output Format

{
  "is_food": 1,
  "image_title": "macaron assortment",
  "food_items": ["yellow macaron", "white macaron", "green macaron"],
  "drink_items": []
}

Quick start

from transformers import pipeline
import torch

# Load the fine-tuned model
pipe = pipeline(
    "image-text-to-text",
    model="CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune",
    dtype=torch.bfloat16,
    device_map="auto"
)

# Prepare input message
message = [{
    "role": "user",
    "content": [
        {"type": "image", "image": your_image},  # PIL.Image object
        {"type": "text", "text": """Classify the given input image into food or not and if edible food or drink items are present, extract those to a list. If no food/drink items are visible, return empty lists.

Only return valid JSON in the following form:

```json
{
  'is_food': 0,
  'image_title': '',
  'food_items': [],
  'drink_items': []
}
```} ] }]

Training Details

Dataset

Split Samples Description
Train 1,208 80% of total dataset
Validation 302 20% of total dataset
Total 1,510 Food images (1k) + Non-food images (500)

Dataset Source: mrdbourke/FoodExtract-1k-Vision

Training Configuration

Hyperparameter Value
Epochs 4
Batch Size (per device) 4
Gradient Accumulation Steps 4
Effective Batch Size 16
Learning Rate 2e-4
LR Scheduler Constant
Warmup Ratio 0.03
Optimizer AdamW (fused)
Max Grad Norm 1.0
Precision bf16
Gradient Checkpointing

Training Strategy

The Vision Encoder was frozen during training to:

  • Preserve pre-trained visual representations
  • Reduce trainable parameters and memory usage
  • Improve training stability on small datasets
  • Mitigate overfitting

This approach is inspired by the SmolDocling paper.

Training Results

Epoch Training Loss Validation Loss
1 0.0842 0.0759
2 0.0816 0.0757
3 0.0237 0.0751
4 0.0172 0.0807

Final Training Loss: 0.0518

Experiment Tracking

Visualize in Weights & Biases

Demo

Try the model on Hugging Face Spaces:

🚀 FoodExtract-Vision Demo

The demo compares outputs from the base model vs. the fine-tuned model side-by-side.

Limitations

  • Trained on a relatively small dataset (1.5k images)
  • May struggle with complex multi-item food scenes
  • Occasional repetitive generation patterns
  • Best performance on single-dish food images

Framework Versions

Library Version
TRL 0.27.1
Transformers 4.57.6
PyTorch 2.9.0+cu126
Datasets 4.0.0
Tokenizers 0.22.2

Citation

If you use this model, please cite:

@misc{foodextract-vision-2025,
  title        = {FoodExtract-Vision: Fine-tuned SmolVLM2 for Structured Food Extraction},
  author       = {Jarvis Zhang},
  year         = 2025,
  publisher    = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune](https://huggingface.co/CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune)}}
}

Downloads last month
40
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune

Dataset used to train CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune

Space using CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune 1

Paper for CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune