FoodExtract-Vision / README.md
berkeruveyik's picture
Uploading FoodExtract-Vision demo folder
4a1815c verified
---
title: FoodExtract-Vision
emoji: ๐Ÿ•
colorFrom: red
colorTo: yellow
sdk: gradio
sdk_version: "5.50.0"
python_version: "3.12"
app_file: app.py
pinned: false
---
# ๐Ÿ•๐Ÿ” FoodExtract-Vision v1: Fine-tuned SmolVLM2-500M for Structured Food Tag Extraction
[![Model on HuggingFace](https://img.shields.io/badge/๐Ÿค—%20Model-FoodExtract--Vision--SmolVLM2--500M-blue)](https://huggingface.co/berkeruveyik/FoodExtract-Vision-SmolVLM2-500M-fine-tune-v3)
[![Dataset on HuggingFace](https://img.shields.io/badge/๐Ÿค—%20Dataset-vlm--food--4k--not--food-green)](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset)
[![Base Model](https://img.shields.io/badge/๐Ÿง %20Base-SmolVLM2--500M--Video--Instruct-orange)](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct)
[![License](https://img.shields.io/badge/๐Ÿ“„%20License-Apache%202.0-lightgrey)](https://www.apache.org/licenses/LICENSE-2.0)
---
## ๐Ÿ“‹ Overview
**FoodExtract-Vision** is a fine-tuned Vision-Language Model (VLM) that takes any image as input and produces **structured JSON output** classifying whether food/drink items are visible and extracting them into organized lists.
Built on top of [SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct), this project demonstrates that even **small (~500M parameter) VLMs** can be fine-tuned to reliably produce structured outputs for domain-specific tasks โ€” without needing PEFT/LoRA adapters.
> ๐Ÿ’ก **Key Insight:** The base model often fails to follow the required JSON output structure, producing inconsistent or unstructured responses. After two-stage fine-tuning, the model **reliably generates valid JSON** matching the specified schema.
---
## ๐ŸŽฏ What Does It Do?
| | Input | Output |
|---|---|---|
| ๐Ÿ“ธ | Any image (food or non-food) | Structured JSON |
### Output Schema
```json
{
"is_food": 1,
"image_title": "Tandoori chicken with naan bread",
"food_items": ["tandoori chicken", "naan bread", "rice", "salad"],
"drink_items": ["lassi"]
}
```
| Field | Type | Description |
|---|---|---|
| `is_food` | `int` | `0` = no food/drink visible, `1` = food/drink visible |
| `image_title` | `str` | Short food-related caption (blank if no food) |
| `food_items` | `list[str]` | List of visible edible food item nouns |
| `drink_items` | `list[str]` | List of visible edible drink item nouns |
---
## ๐Ÿ› ๏ธ What Was Done โ€” End-to-End Pipeline
This project covers the **full ML lifecycle** from dataset creation to deployment:
### Step 1: ๐Ÿ“Š Dataset Creation (`00_create_vlm_dataset.ipynb`)
1. ๐Ÿท๏ธ Loaded food labels from `data/food_dataset-2.jsonl` (generated via Qwen3-VL-8B inference on Food270 images)
2. ๐Ÿ“ Added metadata fields (`image_id`, `image_name`, `food270_class_name`, `image_source`)
3. ๐Ÿ–ผ๏ธ Sampled **not-food images** from `data/not_food/` and created empty labels with `is_food = 0`
4. ๐Ÿ”€ Merged food + not-food labels into a unified dataset
5. ๐Ÿ“ Copied all images into `data/food_all/` and wrote `metadata.jsonl` for HuggingFace `imagefolder` format
6. ๐Ÿš€ Pushed to HuggingFace Hub as [`berkeruveyik/vlm-food-4k-not-food-dataset`](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset)
**Final dataset:** ~3,698 image-JSON pairs across **270 food categories** + not-food images
### Step 2: ๐Ÿงช Base Model Evaluation (`01_fine_tune_vlm_v3_smolVLM_500m.ipynb`)
- Tested `SmolVLM2-500M-Video-Instruct` on the food extraction task
- **Result:** The base model produced unstructured text like *"The given image is a food or drink item."* instead of valid JSON
- โŒ Base model **cannot** follow the structured output format
### Step 3: ๐Ÿ“ Data Formatting for SFT
Converted each sample to a **conversational message format** with three roles:
```
[SYSTEM] โ†’ Expert food extractor persona
[USER] โ†’ Image + JSON extraction prompt
[ASSISTANT] โ†’ Ground truth JSON output
```
- Used `PIL.Image` objects directly (not bytes) to preserve image quality
- 80/20 train/validation split with `random.seed(42)` for reproducibility
### Step 4: ๐ŸงŠ Stage 1 Training โ€” Frozen Vision Encoder
- **Froze** the vision encoder (`model.model.vision_model`)
- **Trained** only the LLM + connector layers
- **Goal:** Teach the language model to output valid JSON structure
- Used `SFTTrainer` from TRL with custom `collate_fn` for image-text batching
### Step 5: ๐Ÿ”ฅ Stage 2 Training โ€” Full Model Fine-tuning
- **Unfroze** the vision encoder
- **Trained** all parameters with a **100x lower learning rate** (`2e-6` vs `2e-4`)
- **Goal:** Allow the vision encoder to adapt for better food recognition without catastrophic forgetting
### Step 6: ๐Ÿ“ˆ Evaluation & Comparison
- Compared outputs from 3 models side-by-side:
- ๐Ÿ”ด **Pre-trained** (base model) โ€” fails at structured output
- ๐ŸŸก **Stage 1** (frozen vision) โ€” learns JSON format
- ๐ŸŸข **Stage 2** (full fine-tune) โ€” best food recognition + JSON format
### Step 7: ๐Ÿš€ Deployment
- Uploaded fine-tuned model to HuggingFace Hub
- Built Gradio demo with side-by-side comparison
- Deployed as a HuggingFace Space
---
## ๐Ÿ—๏ธ Architecture & Training Details
### ๐Ÿง  Base Model
| Property | Value |
|---|---|
| Model | `HuggingFaceTB/SmolVLM2-500M-Video-Instruct` |
| Parameters | ~500M |
| Precision | `bfloat16` |
| Attention | `eager` |
### ๐Ÿ“Š Dataset
| Property | Value |
|---|---|
| Source | [`berkeruveyik/vlm-food-4k-not-food-dataset`](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset) |
| Total Samples | ~3,698 image-JSON pairs |
| Train / Val Split | 80% / 20% |
| Food Categories | 270 (from Food270 dataset) |
| Non-food Images | Random internet images |
| Label Source | Qwen3-VL-8B inference outputs |
### ๐Ÿ”ง Two-Stage Training Strategy
Inspired by the [SmolVLM Docling paper](https://arxiv.org/pdf/2503.11576):
#### ๐ŸงŠ Stage 1: LLM Alignment (Frozen Vision Encoder)
| Parameter | Value |
|---|---|
| Vision Encoder | โ„๏ธ Frozen |
| Trainable | LLM + connector layers |
| Learning Rate | `2e-4` |
| Epochs | 2 |
| Batch Size | 8 ร— 4 gradient accumulation = effective 32 |
| Optimizer | `adamw_torch_fused` |
| LR Scheduler | `constant` |
| Warmup Ratio | `0.03` |
| Precision | `bf16` |
#### ๐Ÿ”ฅ Stage 2: Full Model Fine-tuning (Unfrozen Vision Encoder)
| Parameter | Value |
|---|---|
| Vision Encoder | ๐Ÿ”ฅ Unfrozen |
| Trainable | All parameters |
| Learning Rate | `2e-6` (100x lower than Stage 1) |
| Epochs | 2 |
| Batch Size | 8 ร— 4 gradient accumulation = effective 32 |
| Optimizer | `adamw_torch_fused` |
| LR Scheduler | `constant` |
| Warmup Ratio | `0.03` |
| Precision | `bf16` |
---
## ๐Ÿš€ Quick Start
### ๐Ÿ“ฆ Installation
```bash
pip install transformers torch gradio spaces accelerate
```
### ๐Ÿ”ฎ Inference with Pipeline
```python
import torch
from transformers import pipeline
from PIL import Image
FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3"
pipe = pipeline(
"image-text-to-text",
model=FINE_TUNED_MODEL_ID,
dtype=torch.bfloat16,
device_map="auto",
)
prompt = """Classify the given input image into food or not and if edible food or drink items are present, extract those to a list. If no food/drink items are visible, return empty lists.
Only return valid JSON in the following form:
```json
{
"is_food": 0,
"image_title": "",
"food_items": [],
"drink_items": []
}
```
"""
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/your/image.jpg"},
{"type": "text", "text": prompt},
],
}
]
output = pipe(text=messages, max_new_tokens=256)
print(output[0][0]["generated_text"][-1]["content"])
```
### ๐Ÿงช Inference without Pipeline
```python
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image
FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3"
model = AutoModelForImageTextToText.from_pretrained(
FINE_TUNED_MODEL_ID,
attn_implementation="eager",
dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained(FINE_TUNED_MODEL_ID)
image = Image.open("path/to/your/image.jpg")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "YOUR_PROMPT_HERE"},
],
}
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
with torch.inference_mode():
output = model.generate(**inputs, max_new_tokens=256, do_sample=False)
decoded = processor.decode(output[0][input_len:], skip_special_tokens=True)
print(decoded)
```
---
## ๐ŸŽฎ Gradio Demo
This Space runs a **side-by-side comparison** between the base model and the fine-tuned model.
### โ–ถ๏ธ Running Locally
```bash
cd demos/FoodExtract-Vision
pip install -r requirements.txt
python app.py
```
### ๐Ÿ–ฅ๏ธ What the Demo Shows
1. ๐Ÿ“ค **Upload** any image
2. ๐Ÿ”„ **Compare** outputs from the base model vs. the fine-tuned model side-by-side
3. ๐Ÿ“Š See how fine-tuning enables **reliable structured JSON extraction**
### ๐Ÿ“ธ Example Images Included
The demo comes with pre-loaded examples to try instantly.
---
## ๐Ÿ“ Project Structure
```
vlm_finetune/
โ”œโ”€โ”€ ๐Ÿ““ 00_create_vlm_dataset.ipynb # Dataset creation pipeline
โ”œโ”€โ”€ ๐Ÿ““ 01-fine_tune_vlm.ipynb # First fine-tuning experiment (Gemma-3n)
โ”œโ”€โ”€ ๐Ÿ““ 01-fine_tune_vlm-v2-smolVLM.ipynb # SmolVLM 256M experiment
โ”œโ”€โ”€ ๐Ÿ““ 01_fine_tune_vlm_v3_smolVLM_500m.ipynb # โœ… Final: SmolVLM 500M two-stage training
โ”œโ”€โ”€ ๐Ÿ““ qwen3-food270-inference-viewer.ipynb # Dataset visualization tool
โ”œโ”€โ”€ ๐Ÿ“„ README.md # Root project README
โ”œโ”€โ”€ ๐Ÿ“ data/
โ”‚ โ”œโ”€โ”€ food_dataset-2.jsonl # Qwen3-VL-8B inference outputs
โ”‚ โ”œโ”€โ”€ food_labels_updated.json # Processed food labels
โ”‚ โ”œโ”€โ”€ ๐Ÿ“ 10_images_270_class/ # 10 sample images per category
โ”‚ โ”œโ”€โ”€ ๐Ÿ“ food_all/ # Merged dataset (food + not-food)
โ”‚ โ”‚ โ””โ”€โ”€ metadata.jsonl # HuggingFace imagefolder metadata
โ”‚ โ””โ”€โ”€ ๐Ÿ“ not_food/ # Non-food images
โ””โ”€โ”€ ๐Ÿ“ demos/
โ””โ”€โ”€ ๐Ÿ“ FoodExtract-Vision/
โ”œโ”€โ”€ app.py # ๐Ÿš€ Gradio demo application
โ”œโ”€โ”€ README.md # ๐Ÿ“– This file
โ”œโ”€โ”€ requirements.txt # ๐Ÿ“ฆ Python dependencies
โ””โ”€โ”€ ๐Ÿ“ examples/ # ๐Ÿ–ผ๏ธ Example images
โ”œโ”€โ”€ 36741.jpg
โ”œโ”€โ”€ IMG_3808.JPG
โ””โ”€โ”€ istockphoto-175500494-612x612.jpg
```
---
## ๐Ÿ“ Key Learnings & Notes
### โœ… What Worked
- ๐Ÿ—๏ธ **Two-stage training** significantly improved output quality compared to single-stage
- ๐ŸงŠ **Freezing the vision encoder first** let the LLM learn JSON format without vision interference
- ๐Ÿข **100x lower learning rate in Stage 2** (`2e-6` vs `2e-4`) prevented catastrophic forgetting
- ๐Ÿค Even a **500M parameter model** can learn reliable structured output generation
- ๐Ÿ“ **Custom `collate_fn`** with proper label masking (pad tokens + image tokens โ†’ `-100`) was essential
- ๐Ÿ”€ **`remove_unused_columns = False`** is critical when using a custom data collator with `SFTTrainer`
### โš ๏ธ Important Notes
- **Dtype consistency:** Model inputs must match the model's dtype (e.g., `bfloat16` inputs for a `bfloat16` model)
- **System prompt handling:** When not using `transformers.pipeline`, the system prompt may need to be folded into the user prompt
- **PIL images over bytes:** Using `format_data()` as a list comprehension instead of `dataset.map()` preserves PIL image types
- **Gradient checkpointing:** Set `use_reentrant=False` to avoid warnings and ensure compatibility
### ๐Ÿงช Experiments Tried
| Notebook | Model | Approach | Result |
|---|---|---|---|
| `01-fine_tune_vlm.ipynb` | Gemma-3n-E2B | QLoRA + PEFT | โœ… Works but larger model |
| `01-fine_tune_vlm-v2-smolVLM.ipynb` | SmolVLM2-256M | Full fine-tune | ๐ŸŸก Limited capacity |
| `01_fine_tune_vlm_v3_smolVLM_500m.ipynb` | SmolVLM2-500M | **Two-stage full fine-tune** | โœ… **Best results** |
---
## ๐Ÿ”— Links
| Resource | URL |
|---|---|
| ๐Ÿค— Fine-tuned Model | [berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3](https://huggingface.co/berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3) |
| ๐Ÿค— Dataset | [berkeruveyik/vlm-food-4k-not-food-dataset](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset) |
| ๐Ÿค— Base Model | [HuggingFaceTB/SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) |
| ๐Ÿ“„ SmolVLM Docling Paper | [arxiv.org/pdf/2503.11576](https://arxiv.org/pdf/2503.11576) |
| ๐Ÿ“š TRL Documentation | [huggingface.co/docs/trl](https://huggingface.co/docs/trl/main/en/index) |
| ๐Ÿ“š PEFT GitHub | [github.com/huggingface/peft](https://github.com/huggingface/peft) |
| ๐Ÿ“š HF Vision Fine-tune Guide | [ai.google.dev/gemma/docs](https://ai.google.dev/gemma/docs/core/huggingface_vision_finetune_qlora?hl=tr) |
---
## ๐Ÿ“„ License
This project uses Apache 2.0 license. Please refer to the respective model and dataset cards for additional licensing information.
---
*Built with โค๏ธ using ๐Ÿค— Transformers, TRL, and Gradio โ€” by [Berker รœveyik](https://huggingface.co/berkeruveyik)*