Spaces:

berkeruveyik
/

FoodExtract-Vision

Sleeping

App Files Files Community

FoodExtract-Vision / README.md

berkeruveyik

Uploading FoodExtract-Vision demo folder

4a1815c verified about 2 months ago

preview code

raw

history blame contribute delete

13.8 kB

A newer version of the Gradio SDK is available: 6.10.0

Upgrade

metadata

title: FoodExtract-Vision
emoji: 🍕
colorFrom: red
colorTo: yellow
sdk: gradio
sdk_version: 5.50.0
python_version: '3.12'
app_file: app.py
pinned: false

🍕🔍 FoodExtract-Vision v1: Fine-tuned SmolVLM2-500M for Structured Food Tag Extraction

📋 Overview

FoodExtract-Vision is a fine-tuned Vision-Language Model (VLM) that takes any image as input and produces structured JSON output classifying whether food/drink items are visible and extracting them into organized lists.

Built on top of SmolVLM2-500M-Video-Instruct, this project demonstrates that even small (~500M parameter) VLMs can be fine-tuned to reliably produce structured outputs for domain-specific tasks — without needing PEFT/LoRA adapters.

💡 Key Insight: The base model often fails to follow the required JSON output structure, producing inconsistent or unstructured responses. After two-stage fine-tuning, the model reliably generates valid JSON matching the specified schema.

🎯 What Does It Do?

	Input	Output
📸	Any image (food or non-food)	Structured JSON

Output Schema

{
  "is_food": 1,
  "image_title": "Tandoori chicken with naan bread",
  "food_items": ["tandoori chicken", "naan bread", "rice", "salad"],
  "drink_items": ["lassi"]
}

Field	Type	Description
`is_food`	`int`	`0` = no food/drink visible, `1` = food/drink visible
`image_title`	`str`	Short food-related caption (blank if no food)
`food_items`	`list[str]`	List of visible edible food item nouns
`drink_items`	`list[str]`	List of visible edible drink item nouns

🛠️ What Was Done — End-to-End Pipeline

This project covers the full ML lifecycle from dataset creation to deployment:

Step 1: 📊 Dataset Creation (`00_create_vlm_dataset.ipynb`)

🏷️ Loaded food labels from data/food_dataset-2.jsonl (generated via Qwen3-VL-8B inference on Food270 images)
📝 Added metadata fields (image_id, image_name, food270_class_name, image_source)
🖼️ Sampled not-food images from data/not_food/ and created empty labels with is_food = 0
🔀 Merged food + not-food labels into a unified dataset
📁 Copied all images into data/food_all/ and wrote metadata.jsonl for HuggingFace imagefolder format
🚀 Pushed to HuggingFace Hub as berkeruveyik/vlm-food-4k-not-food-dataset

Final dataset: ~3,698 image-JSON pairs across 270 food categories + not-food images

Step 2: 🧪 Base Model Evaluation (`01_fine_tune_vlm_v3_smolVLM_500m.ipynb`)

Tested SmolVLM2-500M-Video-Instruct on the food extraction task
Result: The base model produced unstructured text like "The given image is a food or drink item." instead of valid JSON
❌ Base model cannot follow the structured output format

Step 3: 📐 Data Formatting for SFT

Converted each sample to a conversational message format with three roles:

[SYSTEM] → Expert food extractor persona
[USER]   → Image + JSON extraction prompt
[ASSISTANT] → Ground truth JSON output

Used PIL.Image objects directly (not bytes) to preserve image quality
80/20 train/validation split with random.seed(42) for reproducibility

Step 4: 🧊 Stage 1 Training — Frozen Vision Encoder

Froze the vision encoder (model.model.vision_model)
Trained only the LLM + connector layers
Goal: Teach the language model to output valid JSON structure
Used SFTTrainer from TRL with custom collate_fn for image-text batching

Step 5: 🔥 Stage 2 Training — Full Model Fine-tuning

Unfroze the vision encoder
Trained all parameters with a 100x lower learning rate (2e-6 vs 2e-4)
Goal: Allow the vision encoder to adapt for better food recognition without catastrophic forgetting

Step 6: 📈 Evaluation & Comparison

Compared outputs from 3 models side-by-side:
- 🔴 Pre-trained (base model) — fails at structured output
- 🟡 Stage 1 (frozen vision) — learns JSON format
- 🟢 Stage 2 (full fine-tune) — best food recognition + JSON format

Step 7: 🚀 Deployment

Uploaded fine-tuned model to HuggingFace Hub
Built Gradio demo with side-by-side comparison
Deployed as a HuggingFace Space

🏗️ Architecture & Training Details

🧠 Base Model

Property	Value
Model	`HuggingFaceTB/SmolVLM2-500M-Video-Instruct`
Parameters	~500M
Precision	`bfloat16`
Attention	`eager`

📊 Dataset

Property	Value
Source	`berkeruveyik/vlm-food-4k-not-food-dataset`
Total Samples	~3,698 image-JSON pairs
Train / Val Split	80% / 20%
Food Categories	270 (from Food270 dataset)
Non-food Images	Random internet images
Label Source	Qwen3-VL-8B inference outputs

🔧 Two-Stage Training Strategy

Inspired by the SmolVLM Docling paper:

🧊 Stage 1: LLM Alignment (Frozen Vision Encoder)

Parameter	Value
Vision Encoder	❄️ Frozen
Trainable	LLM + connector layers
Learning Rate	`2e-4`
Epochs	2
Batch Size	8 × 4 gradient accumulation = effective 32
Optimizer	`adamw_torch_fused`
LR Scheduler	`constant`
Warmup Ratio	`0.03`
Precision	`bf16`

🔥 Stage 2: Full Model Fine-tuning (Unfrozen Vision Encoder)

Parameter	Value
Vision Encoder	🔥 Unfrozen
Trainable	All parameters
Learning Rate	`2e-6` (100x lower than Stage 1)
Epochs	2
Batch Size	8 × 4 gradient accumulation = effective 32
Optimizer	`adamw_torch_fused`
LR Scheduler	`constant`
Warmup Ratio	`0.03`
Precision	`bf16`

🚀 Quick Start

📦 Installation

pip install transformers torch gradio spaces accelerate

🔮 Inference with Pipeline

import torch
from transformers import pipeline
from PIL import Image

FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3"

pipe = pipeline(
    "image-text-to-text",
    model=FINE_TUNED_MODEL_ID,
    dtype=torch.bfloat16,
    device_map="auto",
)

prompt = """Classify the given input image into food or not and if edible food or drink items are present, extract those to a list. If no food/drink items are visible, return empty lists.

Only return valid JSON in the following form:

```json
{
  "is_food": 0,
  "image_title": "",
  "food_items": [],
  "drink_items": []
}

"""

messages = [ { "role": "user", "content": [ {"type": "image", "image": "path/to/your/image.jpg"}, {"type": "text", "text": prompt}, ], } ]

output = pipe(text=messages, max_new_tokens=256) print(output[0][0]["generated_text"][-1]["content"])


### 🧪 Inference without Pipeline

```python
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3"

model = AutoModelForImageTextToText.from_pretrained(
    FINE_TUNED_MODEL_ID,
    attn_implementation="eager",
    dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(FINE_TUNED_MODEL_ID)

image = Image.open("path/to/your/image.jpg")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "YOUR_PROMPT_HERE"},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    output = model.generate(**inputs, max_new_tokens=256, do_sample=False)

decoded = processor.decode(output[0][input_len:], skip_special_tokens=True)
print(decoded)

🎮 Gradio Demo

This Space runs a side-by-side comparison between the base model and the fine-tuned model.

▶️ Running Locally

cd demos/FoodExtract-Vision
pip install -r requirements.txt
python app.py

🖥️ What the Demo Shows

📤 Upload any image
🔄 Compare outputs from the base model vs. the fine-tuned model side-by-side
📊 See how fine-tuning enables reliable structured JSON extraction

📸 Example Images Included

The demo comes with pre-loaded examples to try instantly.

📁 Project Structure

vlm_finetune/
├── 📓 00_create_vlm_dataset.ipynb          # Dataset creation pipeline
├── 📓 01-fine_tune_vlm.ipynb               # First fine-tuning experiment (Gemma-3n)
├── 📓 01-fine_tune_vlm-v2-smolVLM.ipynb    # SmolVLM 256M experiment
├── 📓 01_fine_tune_vlm_v3_smolVLM_500m.ipynb # ✅ Final: SmolVLM 500M two-stage training
├── 📓 qwen3-food270-inference-viewer.ipynb  # Dataset visualization tool
├── 📄 README.md                            # Root project README
├── 📁 data/
│   ├── food_dataset-2.jsonl                # Qwen3-VL-8B inference outputs
│   ├── food_labels_updated.json            # Processed food labels
│   ├── 📁 10_images_270_class/             # 10 sample images per category
│   ├── 📁 food_all/                        # Merged dataset (food + not-food)
│   │   └── metadata.jsonl                  # HuggingFace imagefolder metadata
│   └── 📁 not_food/                        # Non-food images
└── 📁 demos/
    └── 📁 FoodExtract-Vision/
        ├── app.py                          # 🚀 Gradio demo application
        ├── README.md                       # 📖 This file
        ├── requirements.txt                # 📦 Python dependencies
        └── 📁 examples/                    # 🖼️ Example images
            ├── 36741.jpg
            ├── IMG_3808.JPG
            └── istockphoto-175500494-612x612.jpg

📝 Key Learnings & Notes

✅ What Worked

🏗️ Two-stage training significantly improved output quality compared to single-stage
🧊 Freezing the vision encoder first let the LLM learn JSON format without vision interference
🐢 100x lower learning rate in Stage 2 (2e-6 vs 2e-4) prevented catastrophic forgetting
🤏 Even a 500M parameter model can learn reliable structured output generation
📝 Custom collate_fn with proper label masking (pad tokens + image tokens → -100) was essential
🔀 remove_unused_columns = False is critical when using a custom data collator with SFTTrainer

⚠️ Important Notes

Dtype consistency: Model inputs must match the model's dtype (e.g., bfloat16 inputs for a bfloat16 model)
System prompt handling: When not using transformers.pipeline, the system prompt may need to be folded into the user prompt
PIL images over bytes: Using format_data() as a list comprehension instead of dataset.map() preserves PIL image types
Gradient checkpointing: Set use_reentrant=False to avoid warnings and ensure compatibility

🧪 Experiments Tried

Notebook	Model	Approach	Result
`01-fine_tune_vlm.ipynb`	Gemma-3n-E2B	QLoRA + PEFT	✅ Works but larger model
`01-fine_tune_vlm-v2-smolVLM.ipynb`	SmolVLM2-256M	Full fine-tune	🟡 Limited capacity
`01_fine_tune_vlm_v3_smolVLM_500m.ipynb`	SmolVLM2-500M	Two-stage full fine-tune	✅ Best results

🔗 Links

Resource	URL
🤗 Fine-tuned Model	berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3
🤗 Dataset	berkeruveyik/vlm-food-4k-not-food-dataset
🤗 Base Model	HuggingFaceTB/SmolVLM2-500M-Video-Instruct
📄 SmolVLM Docling Paper	arxiv.org/pdf/2503.11576
📚 TRL Documentation	huggingface.co/docs/trl
📚 PEFT GitHub	github.com/huggingface/peft
📚 HF Vision Fine-tune Guide	ai.google.dev/gemma/docs

📄 License

This project uses Apache 2.0 license. Please refer to the respective model and dataset cards for additional licensing information.

Built with ❤️ using 🤗 Transformers, TRL, and Gradio — by Berker Üveyik