Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.10.0
title: FoodExtract-Vision
emoji: ๐
colorFrom: red
colorTo: yellow
sdk: gradio
sdk_version: 5.50.0
python_version: '3.12'
app_file: app.py
pinned: false
๐๐ FoodExtract-Vision v1: Fine-tuned SmolVLM2-500M for Structured Food Tag Extraction
๐ Overview
FoodExtract-Vision is a fine-tuned Vision-Language Model (VLM) that takes any image as input and produces structured JSON output classifying whether food/drink items are visible and extracting them into organized lists.
Built on top of SmolVLM2-500M-Video-Instruct, this project demonstrates that even small (~500M parameter) VLMs can be fine-tuned to reliably produce structured outputs for domain-specific tasks โ without needing PEFT/LoRA adapters.
๐ก Key Insight: The base model often fails to follow the required JSON output structure, producing inconsistent or unstructured responses. After two-stage fine-tuning, the model reliably generates valid JSON matching the specified schema.
๐ฏ What Does It Do?
| Input | Output | |
|---|---|---|
| ๐ธ | Any image (food or non-food) | Structured JSON |
Output Schema
{
"is_food": 1,
"image_title": "Tandoori chicken with naan bread",
"food_items": ["tandoori chicken", "naan bread", "rice", "salad"],
"drink_items": ["lassi"]
}
| Field | Type | Description |
|---|---|---|
is_food |
int |
0 = no food/drink visible, 1 = food/drink visible |
image_title |
str |
Short food-related caption (blank if no food) |
food_items |
list[str] |
List of visible edible food item nouns |
drink_items |
list[str] |
List of visible edible drink item nouns |
๐ ๏ธ What Was Done โ End-to-End Pipeline
This project covers the full ML lifecycle from dataset creation to deployment:
Step 1: ๐ Dataset Creation (00_create_vlm_dataset.ipynb)
- ๐ท๏ธ Loaded food labels from
data/food_dataset-2.jsonl(generated via Qwen3-VL-8B inference on Food270 images) - ๐ Added metadata fields (
image_id,image_name,food270_class_name,image_source) - ๐ผ๏ธ Sampled not-food images from
data/not_food/and created empty labels withis_food = 0 - ๐ Merged food + not-food labels into a unified dataset
- ๐ Copied all images into
data/food_all/and wrotemetadata.jsonlfor HuggingFaceimagefolderformat - ๐ Pushed to HuggingFace Hub as
berkeruveyik/vlm-food-4k-not-food-dataset
Final dataset: ~3,698 image-JSON pairs across 270 food categories + not-food images
Step 2: ๐งช Base Model Evaluation (01_fine_tune_vlm_v3_smolVLM_500m.ipynb)
- Tested
SmolVLM2-500M-Video-Instructon the food extraction task - Result: The base model produced unstructured text like "The given image is a food or drink item." instead of valid JSON
- โ Base model cannot follow the structured output format
Step 3: ๐ Data Formatting for SFT
Converted each sample to a conversational message format with three roles:
[SYSTEM] โ Expert food extractor persona
[USER] โ Image + JSON extraction prompt
[ASSISTANT] โ Ground truth JSON output
- Used
PIL.Imageobjects directly (not bytes) to preserve image quality - 80/20 train/validation split with
random.seed(42)for reproducibility
Step 4: ๐ง Stage 1 Training โ Frozen Vision Encoder
- Froze the vision encoder (
model.model.vision_model) - Trained only the LLM + connector layers
- Goal: Teach the language model to output valid JSON structure
- Used
SFTTrainerfrom TRL with customcollate_fnfor image-text batching
Step 5: ๐ฅ Stage 2 Training โ Full Model Fine-tuning
- Unfroze the vision encoder
- Trained all parameters with a 100x lower learning rate (
2e-6vs2e-4) - Goal: Allow the vision encoder to adapt for better food recognition without catastrophic forgetting
Step 6: ๐ Evaluation & Comparison
- Compared outputs from 3 models side-by-side:
- ๐ด Pre-trained (base model) โ fails at structured output
- ๐ก Stage 1 (frozen vision) โ learns JSON format
- ๐ข Stage 2 (full fine-tune) โ best food recognition + JSON format
Step 7: ๐ Deployment
- Uploaded fine-tuned model to HuggingFace Hub
- Built Gradio demo with side-by-side comparison
- Deployed as a HuggingFace Space
๐๏ธ Architecture & Training Details
๐ง Base Model
| Property | Value |
|---|---|
| Model | HuggingFaceTB/SmolVLM2-500M-Video-Instruct |
| Parameters | ~500M |
| Precision | bfloat16 |
| Attention | eager |
๐ Dataset
| Property | Value |
|---|---|
| Source | berkeruveyik/vlm-food-4k-not-food-dataset |
| Total Samples | ~3,698 image-JSON pairs |
| Train / Val Split | 80% / 20% |
| Food Categories | 270 (from Food270 dataset) |
| Non-food Images | Random internet images |
| Label Source | Qwen3-VL-8B inference outputs |
๐ง Two-Stage Training Strategy
Inspired by the SmolVLM Docling paper:
๐ง Stage 1: LLM Alignment (Frozen Vision Encoder)
| Parameter | Value |
|---|---|
| Vision Encoder | โ๏ธ Frozen |
| Trainable | LLM + connector layers |
| Learning Rate | 2e-4 |
| Epochs | 2 |
| Batch Size | 8 ร 4 gradient accumulation = effective 32 |
| Optimizer | adamw_torch_fused |
| LR Scheduler | constant |
| Warmup Ratio | 0.03 |
| Precision | bf16 |
๐ฅ Stage 2: Full Model Fine-tuning (Unfrozen Vision Encoder)
| Parameter | Value |
|---|---|
| Vision Encoder | ๐ฅ Unfrozen |
| Trainable | All parameters |
| Learning Rate | 2e-6 (100x lower than Stage 1) |
| Epochs | 2 |
| Batch Size | 8 ร 4 gradient accumulation = effective 32 |
| Optimizer | adamw_torch_fused |
| LR Scheduler | constant |
| Warmup Ratio | 0.03 |
| Precision | bf16 |
๐ Quick Start
๐ฆ Installation
pip install transformers torch gradio spaces accelerate
๐ฎ Inference with Pipeline
import torch
from transformers import pipeline
from PIL import Image
FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3"
pipe = pipeline(
"image-text-to-text",
model=FINE_TUNED_MODEL_ID,
dtype=torch.bfloat16,
device_map="auto",
)
prompt = """Classify the given input image into food or not and if edible food or drink items are present, extract those to a list. If no food/drink items are visible, return empty lists.
Only return valid JSON in the following form:
```json
{
"is_food": 0,
"image_title": "",
"food_items": [],
"drink_items": []
}
"""
messages = [ { "role": "user", "content": [ {"type": "image", "image": "path/to/your/image.jpg"}, {"type": "text", "text": prompt}, ], } ]
output = pipe(text=messages, max_new_tokens=256) print(output[0][0]["generated_text"][-1]["content"])
### ๐งช Inference without Pipeline
```python
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image
FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3"
model = AutoModelForImageTextToText.from_pretrained(
FINE_TUNED_MODEL_ID,
attn_implementation="eager",
dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained(FINE_TUNED_MODEL_ID)
image = Image.open("path/to/your/image.jpg")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "YOUR_PROMPT_HERE"},
],
}
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
with torch.inference_mode():
output = model.generate(**inputs, max_new_tokens=256, do_sample=False)
decoded = processor.decode(output[0][input_len:], skip_special_tokens=True)
print(decoded)
๐ฎ Gradio Demo
This Space runs a side-by-side comparison between the base model and the fine-tuned model.
โถ๏ธ Running Locally
cd demos/FoodExtract-Vision
pip install -r requirements.txt
python app.py
๐ฅ๏ธ What the Demo Shows
- ๐ค Upload any image
- ๐ Compare outputs from the base model vs. the fine-tuned model side-by-side
- ๐ See how fine-tuning enables reliable structured JSON extraction
๐ธ Example Images Included
The demo comes with pre-loaded examples to try instantly.
๐ Project Structure
vlm_finetune/
โโโ ๐ 00_create_vlm_dataset.ipynb # Dataset creation pipeline
โโโ ๐ 01-fine_tune_vlm.ipynb # First fine-tuning experiment (Gemma-3n)
โโโ ๐ 01-fine_tune_vlm-v2-smolVLM.ipynb # SmolVLM 256M experiment
โโโ ๐ 01_fine_tune_vlm_v3_smolVLM_500m.ipynb # โ
Final: SmolVLM 500M two-stage training
โโโ ๐ qwen3-food270-inference-viewer.ipynb # Dataset visualization tool
โโโ ๐ README.md # Root project README
โโโ ๐ data/
โ โโโ food_dataset-2.jsonl # Qwen3-VL-8B inference outputs
โ โโโ food_labels_updated.json # Processed food labels
โ โโโ ๐ 10_images_270_class/ # 10 sample images per category
โ โโโ ๐ food_all/ # Merged dataset (food + not-food)
โ โ โโโ metadata.jsonl # HuggingFace imagefolder metadata
โ โโโ ๐ not_food/ # Non-food images
โโโ ๐ demos/
โโโ ๐ FoodExtract-Vision/
โโโ app.py # ๐ Gradio demo application
โโโ README.md # ๐ This file
โโโ requirements.txt # ๐ฆ Python dependencies
โโโ ๐ examples/ # ๐ผ๏ธ Example images
โโโ 36741.jpg
โโโ IMG_3808.JPG
โโโ istockphoto-175500494-612x612.jpg
๐ Key Learnings & Notes
โ What Worked
- ๐๏ธ Two-stage training significantly improved output quality compared to single-stage
- ๐ง Freezing the vision encoder first let the LLM learn JSON format without vision interference
- ๐ข 100x lower learning rate in Stage 2 (
2e-6vs2e-4) prevented catastrophic forgetting - ๐ค Even a 500M parameter model can learn reliable structured output generation
- ๐ Custom
collate_fnwith proper label masking (pad tokens + image tokens โ-100) was essential - ๐
remove_unused_columns = Falseis critical when using a custom data collator withSFTTrainer
โ ๏ธ Important Notes
- Dtype consistency: Model inputs must match the model's dtype (e.g.,
bfloat16inputs for abfloat16model) - System prompt handling: When not using
transformers.pipeline, the system prompt may need to be folded into the user prompt - PIL images over bytes: Using
format_data()as a list comprehension instead ofdataset.map()preserves PIL image types - Gradient checkpointing: Set
use_reentrant=Falseto avoid warnings and ensure compatibility
๐งช Experiments Tried
| Notebook | Model | Approach | Result |
|---|---|---|---|
01-fine_tune_vlm.ipynb |
Gemma-3n-E2B | QLoRA + PEFT | โ Works but larger model |
01-fine_tune_vlm-v2-smolVLM.ipynb |
SmolVLM2-256M | Full fine-tune | ๐ก Limited capacity |
01_fine_tune_vlm_v3_smolVLM_500m.ipynb |
SmolVLM2-500M | Two-stage full fine-tune | โ Best results |
๐ Links
| Resource | URL |
|---|---|
| ๐ค Fine-tuned Model | berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3 |
| ๐ค Dataset | berkeruveyik/vlm-food-4k-not-food-dataset |
| ๐ค Base Model | HuggingFaceTB/SmolVLM2-500M-Video-Instruct |
| ๐ SmolVLM Docling Paper | arxiv.org/pdf/2503.11576 |
| ๐ TRL Documentation | huggingface.co/docs/trl |
| ๐ PEFT GitHub | github.com/huggingface/peft |
| ๐ HF Vision Fine-tune Guide | ai.google.dev/gemma/docs |
๐ License
This project uses Apache 2.0 license. Please refer to the respective model and dataset cards for additional licensing information.
Built with โค๏ธ using ๐ค Transformers, TRL, and Gradio โ by Berker รveyik