--- title: FoodExtract-Vision emoji: ๐Ÿ• colorFrom: red colorTo: yellow sdk: gradio sdk_version: "5.50.0" python_version: "3.12" app_file: app.py pinned: false --- # ๐Ÿ•๐Ÿ” FoodExtract-Vision v1: Fine-tuned SmolVLM2-500M for Structured Food Tag Extraction [![Model on HuggingFace](https://img.shields.io/badge/๐Ÿค—%20Model-FoodExtract--Vision--SmolVLM2--500M-blue)](https://huggingface.co/berkeruveyik/FoodExtract-Vision-SmolVLM2-500M-fine-tune-v3) [![Dataset on HuggingFace](https://img.shields.io/badge/๐Ÿค—%20Dataset-vlm--food--4k--not--food-green)](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset) [![Base Model](https://img.shields.io/badge/๐Ÿง %20Base-SmolVLM2--500M--Video--Instruct-orange)](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) [![License](https://img.shields.io/badge/๐Ÿ“„%20License-Apache%202.0-lightgrey)](https://www.apache.org/licenses/LICENSE-2.0) --- ## ๐Ÿ“‹ Overview **FoodExtract-Vision** is a fine-tuned Vision-Language Model (VLM) that takes any image as input and produces **structured JSON output** classifying whether food/drink items are visible and extracting them into organized lists. Built on top of [SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct), this project demonstrates that even **small (~500M parameter) VLMs** can be fine-tuned to reliably produce structured outputs for domain-specific tasks โ€” without needing PEFT/LoRA adapters. > ๐Ÿ’ก **Key Insight:** The base model often fails to follow the required JSON output structure, producing inconsistent or unstructured responses. After two-stage fine-tuning, the model **reliably generates valid JSON** matching the specified schema. --- ## ๐ŸŽฏ What Does It Do? | | Input | Output | |---|---|---| | ๐Ÿ“ธ | Any image (food or non-food) | Structured JSON | ### Output Schema ```json { "is_food": 1, "image_title": "Tandoori chicken with naan bread", "food_items": ["tandoori chicken", "naan bread", "rice", "salad"], "drink_items": ["lassi"] } ``` | Field | Type | Description | |---|---|---| | `is_food` | `int` | `0` = no food/drink visible, `1` = food/drink visible | | `image_title` | `str` | Short food-related caption (blank if no food) | | `food_items` | `list[str]` | List of visible edible food item nouns | | `drink_items` | `list[str]` | List of visible edible drink item nouns | --- ## ๐Ÿ› ๏ธ What Was Done โ€” End-to-End Pipeline This project covers the **full ML lifecycle** from dataset creation to deployment: ### Step 1: ๐Ÿ“Š Dataset Creation (`00_create_vlm_dataset.ipynb`) 1. ๐Ÿท๏ธ Loaded food labels from `data/food_dataset-2.jsonl` (generated via Qwen3-VL-8B inference on Food270 images) 2. ๐Ÿ“ Added metadata fields (`image_id`, `image_name`, `food270_class_name`, `image_source`) 3. ๐Ÿ–ผ๏ธ Sampled **not-food images** from `data/not_food/` and created empty labels with `is_food = 0` 4. ๐Ÿ”€ Merged food + not-food labels into a unified dataset 5. ๐Ÿ“ Copied all images into `data/food_all/` and wrote `metadata.jsonl` for HuggingFace `imagefolder` format 6. ๐Ÿš€ Pushed to HuggingFace Hub as [`berkeruveyik/vlm-food-4k-not-food-dataset`](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset) **Final dataset:** ~3,698 image-JSON pairs across **270 food categories** + not-food images ### Step 2: ๐Ÿงช Base Model Evaluation (`01_fine_tune_vlm_v3_smolVLM_500m.ipynb`) - Tested `SmolVLM2-500M-Video-Instruct` on the food extraction task - **Result:** The base model produced unstructured text like *"The given image is a food or drink item."* instead of valid JSON - โŒ Base model **cannot** follow the structured output format ### Step 3: ๐Ÿ“ Data Formatting for SFT Converted each sample to a **conversational message format** with three roles: ``` [SYSTEM] โ†’ Expert food extractor persona [USER] โ†’ Image + JSON extraction prompt [ASSISTANT] โ†’ Ground truth JSON output ``` - Used `PIL.Image` objects directly (not bytes) to preserve image quality - 80/20 train/validation split with `random.seed(42)` for reproducibility ### Step 4: ๐ŸงŠ Stage 1 Training โ€” Frozen Vision Encoder - **Froze** the vision encoder (`model.model.vision_model`) - **Trained** only the LLM + connector layers - **Goal:** Teach the language model to output valid JSON structure - Used `SFTTrainer` from TRL with custom `collate_fn` for image-text batching ### Step 5: ๐Ÿ”ฅ Stage 2 Training โ€” Full Model Fine-tuning - **Unfroze** the vision encoder - **Trained** all parameters with a **100x lower learning rate** (`2e-6` vs `2e-4`) - **Goal:** Allow the vision encoder to adapt for better food recognition without catastrophic forgetting ### Step 6: ๐Ÿ“ˆ Evaluation & Comparison - Compared outputs from 3 models side-by-side: - ๐Ÿ”ด **Pre-trained** (base model) โ€” fails at structured output - ๐ŸŸก **Stage 1** (frozen vision) โ€” learns JSON format - ๐ŸŸข **Stage 2** (full fine-tune) โ€” best food recognition + JSON format ### Step 7: ๐Ÿš€ Deployment - Uploaded fine-tuned model to HuggingFace Hub - Built Gradio demo with side-by-side comparison - Deployed as a HuggingFace Space --- ## ๐Ÿ—๏ธ Architecture & Training Details ### ๐Ÿง  Base Model | Property | Value | |---|---| | Model | `HuggingFaceTB/SmolVLM2-500M-Video-Instruct` | | Parameters | ~500M | | Precision | `bfloat16` | | Attention | `eager` | ### ๐Ÿ“Š Dataset | Property | Value | |---|---| | Source | [`berkeruveyik/vlm-food-4k-not-food-dataset`](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset) | | Total Samples | ~3,698 image-JSON pairs | | Train / Val Split | 80% / 20% | | Food Categories | 270 (from Food270 dataset) | | Non-food Images | Random internet images | | Label Source | Qwen3-VL-8B inference outputs | ### ๐Ÿ”ง Two-Stage Training Strategy Inspired by the [SmolVLM Docling paper](https://arxiv.org/pdf/2503.11576): #### ๐ŸงŠ Stage 1: LLM Alignment (Frozen Vision Encoder) | Parameter | Value | |---|---| | Vision Encoder | โ„๏ธ Frozen | | Trainable | LLM + connector layers | | Learning Rate | `2e-4` | | Epochs | 2 | | Batch Size | 8 ร— 4 gradient accumulation = effective 32 | | Optimizer | `adamw_torch_fused` | | LR Scheduler | `constant` | | Warmup Ratio | `0.03` | | Precision | `bf16` | #### ๐Ÿ”ฅ Stage 2: Full Model Fine-tuning (Unfrozen Vision Encoder) | Parameter | Value | |---|---| | Vision Encoder | ๐Ÿ”ฅ Unfrozen | | Trainable | All parameters | | Learning Rate | `2e-6` (100x lower than Stage 1) | | Epochs | 2 | | Batch Size | 8 ร— 4 gradient accumulation = effective 32 | | Optimizer | `adamw_torch_fused` | | LR Scheduler | `constant` | | Warmup Ratio | `0.03` | | Precision | `bf16` | --- ## ๐Ÿš€ Quick Start ### ๐Ÿ“ฆ Installation ```bash pip install transformers torch gradio spaces accelerate ``` ### ๐Ÿ”ฎ Inference with Pipeline ```python import torch from transformers import pipeline from PIL import Image FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3" pipe = pipeline( "image-text-to-text", model=FINE_TUNED_MODEL_ID, dtype=torch.bfloat16, device_map="auto", ) prompt = """Classify the given input image into food or not and if edible food or drink items are present, extract those to a list. If no food/drink items are visible, return empty lists. Only return valid JSON in the following form: ```json { "is_food": 0, "image_title": "", "food_items": [], "drink_items": [] } ``` """ messages = [ { "role": "user", "content": [ {"type": "image", "image": "path/to/your/image.jpg"}, {"type": "text", "text": prompt}, ], } ] output = pipe(text=messages, max_new_tokens=256) print(output[0][0]["generated_text"][-1]["content"]) ``` ### ๐Ÿงช Inference without Pipeline ```python import torch from transformers import AutoModelForImageTextToText, AutoProcessor from PIL import Image FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3" model = AutoModelForImageTextToText.from_pretrained( FINE_TUNED_MODEL_ID, attn_implementation="eager", dtype=torch.bfloat16, device_map="auto", ) processor = AutoProcessor.from_pretrained(FINE_TUNED_MODEL_ID) image = Image.open("path/to/your/image.jpg") messages = [ { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "YOUR_PROMPT_HERE"}, ], } ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) input_len = inputs["input_ids"].shape[-1] with torch.inference_mode(): output = model.generate(**inputs, max_new_tokens=256, do_sample=False) decoded = processor.decode(output[0][input_len:], skip_special_tokens=True) print(decoded) ``` --- ## ๐ŸŽฎ Gradio Demo This Space runs a **side-by-side comparison** between the base model and the fine-tuned model. ### โ–ถ๏ธ Running Locally ```bash cd demos/FoodExtract-Vision pip install -r requirements.txt python app.py ``` ### ๐Ÿ–ฅ๏ธ What the Demo Shows 1. ๐Ÿ“ค **Upload** any image 2. ๐Ÿ”„ **Compare** outputs from the base model vs. the fine-tuned model side-by-side 3. ๐Ÿ“Š See how fine-tuning enables **reliable structured JSON extraction** ### ๐Ÿ“ธ Example Images Included The demo comes with pre-loaded examples to try instantly. --- ## ๐Ÿ“ Project Structure ``` vlm_finetune/ โ”œโ”€โ”€ ๐Ÿ““ 00_create_vlm_dataset.ipynb # Dataset creation pipeline โ”œโ”€โ”€ ๐Ÿ““ 01-fine_tune_vlm.ipynb # First fine-tuning experiment (Gemma-3n) โ”œโ”€โ”€ ๐Ÿ““ 01-fine_tune_vlm-v2-smolVLM.ipynb # SmolVLM 256M experiment โ”œโ”€โ”€ ๐Ÿ““ 01_fine_tune_vlm_v3_smolVLM_500m.ipynb # โœ… Final: SmolVLM 500M two-stage training โ”œโ”€โ”€ ๐Ÿ““ qwen3-food270-inference-viewer.ipynb # Dataset visualization tool โ”œโ”€โ”€ ๐Ÿ“„ README.md # Root project README โ”œโ”€โ”€ ๐Ÿ“ data/ โ”‚ โ”œโ”€โ”€ food_dataset-2.jsonl # Qwen3-VL-8B inference outputs โ”‚ โ”œโ”€โ”€ food_labels_updated.json # Processed food labels โ”‚ โ”œโ”€โ”€ ๐Ÿ“ 10_images_270_class/ # 10 sample images per category โ”‚ โ”œโ”€โ”€ ๐Ÿ“ food_all/ # Merged dataset (food + not-food) โ”‚ โ”‚ โ””โ”€โ”€ metadata.jsonl # HuggingFace imagefolder metadata โ”‚ โ””โ”€โ”€ ๐Ÿ“ not_food/ # Non-food images โ””โ”€โ”€ ๐Ÿ“ demos/ โ””โ”€โ”€ ๐Ÿ“ FoodExtract-Vision/ โ”œโ”€โ”€ app.py # ๐Ÿš€ Gradio demo application โ”œโ”€โ”€ README.md # ๐Ÿ“– This file โ”œโ”€โ”€ requirements.txt # ๐Ÿ“ฆ Python dependencies โ””โ”€โ”€ ๐Ÿ“ examples/ # ๐Ÿ–ผ๏ธ Example images โ”œโ”€โ”€ 36741.jpg โ”œโ”€โ”€ IMG_3808.JPG โ””โ”€โ”€ istockphoto-175500494-612x612.jpg ``` --- ## ๐Ÿ“ Key Learnings & Notes ### โœ… What Worked - ๐Ÿ—๏ธ **Two-stage training** significantly improved output quality compared to single-stage - ๐ŸงŠ **Freezing the vision encoder first** let the LLM learn JSON format without vision interference - ๐Ÿข **100x lower learning rate in Stage 2** (`2e-6` vs `2e-4`) prevented catastrophic forgetting - ๐Ÿค Even a **500M parameter model** can learn reliable structured output generation - ๐Ÿ“ **Custom `collate_fn`** with proper label masking (pad tokens + image tokens โ†’ `-100`) was essential - ๐Ÿ”€ **`remove_unused_columns = False`** is critical when using a custom data collator with `SFTTrainer` ### โš ๏ธ Important Notes - **Dtype consistency:** Model inputs must match the model's dtype (e.g., `bfloat16` inputs for a `bfloat16` model) - **System prompt handling:** When not using `transformers.pipeline`, the system prompt may need to be folded into the user prompt - **PIL images over bytes:** Using `format_data()` as a list comprehension instead of `dataset.map()` preserves PIL image types - **Gradient checkpointing:** Set `use_reentrant=False` to avoid warnings and ensure compatibility ### ๐Ÿงช Experiments Tried | Notebook | Model | Approach | Result | |---|---|---|---| | `01-fine_tune_vlm.ipynb` | Gemma-3n-E2B | QLoRA + PEFT | โœ… Works but larger model | | `01-fine_tune_vlm-v2-smolVLM.ipynb` | SmolVLM2-256M | Full fine-tune | ๐ŸŸก Limited capacity | | `01_fine_tune_vlm_v3_smolVLM_500m.ipynb` | SmolVLM2-500M | **Two-stage full fine-tune** | โœ… **Best results** | --- ## ๐Ÿ”— Links | Resource | URL | |---|---| | ๐Ÿค— Fine-tuned Model | [berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3](https://huggingface.co/berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3) | | ๐Ÿค— Dataset | [berkeruveyik/vlm-food-4k-not-food-dataset](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset) | | ๐Ÿค— Base Model | [HuggingFaceTB/SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) | | ๐Ÿ“„ SmolVLM Docling Paper | [arxiv.org/pdf/2503.11576](https://arxiv.org/pdf/2503.11576) | | ๐Ÿ“š TRL Documentation | [huggingface.co/docs/trl](https://huggingface.co/docs/trl/main/en/index) | | ๐Ÿ“š PEFT GitHub | [github.com/huggingface/peft](https://github.com/huggingface/peft) | | ๐Ÿ“š HF Vision Fine-tune Guide | [ai.google.dev/gemma/docs](https://ai.google.dev/gemma/docs/core/huggingface_vision_finetune_qlora?hl=tr) | --- ## ๐Ÿ“„ License This project uses Apache 2.0 license. Please refer to the respective model and dataset cards for additional licensing information. --- *Built with โค๏ธ using ๐Ÿค— Transformers, TRL, and Gradio โ€” by [Berker รœveyik](https://huggingface.co/berkeruveyik)*