Spaces:

berkeruveyik
/

FoodExtract-Vision

Sleeping

App Files Files Community

berkeruveyik commited on Feb 7

Commit

4a1815c

verified ·

1 Parent(s): 3162c9a

Uploading FoodExtract-Vision demo folder

Browse files

Files changed (1) hide show

README.md +195 -68

README.md CHANGED Viewed

@@ -1,26 +1,41 @@
 # 🍕🔍 FoodExtract-Vision v1: Fine-tuned SmolVLM2-500M for Structured Food Tag Extraction
 [![Model on HuggingFace](https://img.shields.io/badge/🤗%20Model-FoodExtract--Vision--SmolVLM2--500M-blue)](https://huggingface.co/berkeruveyik/FoodExtract-Vision-SmolVLM2-500M-fine-tune-v3)
 [![Dataset on HuggingFace](https://img.shields.io/badge/🤗%20Dataset-vlm--food--4k--not--food-green)](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset)
-## 📋 Overview
-**FoodExtract-Vision** is a fine-tuned Vision-Language Model (VLM) that classifies images as food/not-food and extracts structured food and drink tags in JSON format. Built on top of [SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct), this model demonstrates that even small VLMs can be fine-tuned to reliably produce structured outputs for domain-specific tasks.
-### 🎯 What Does It Do?
-- **Input:** Any image (food or non-food)
-- **Output:** Structured JSON containing:
-  - `is_food` — binary classification (0 or 1)
-  - `image_title` — short food-related caption
-  - `food_items` — list of visible food item nouns
-  - `drink_items` — list of visible drink item nouns
-### 💡 Example Output
 ```json
 {
@@ -31,58 +46,130 @@
 }
 ```
 ---
-## 🏗️ Architecture & Training Pipeline
-### 🧠 Base Model
-- **Model:** `HuggingFaceTB/SmolVLM2-500M-Video-Instruct`
-- **Parameters:** ~500M
-- **Precision:** `bfloat16`
-### 📊 Dataset
-- **Source:** [`berkeruveyik/vlm-food-4k-not-food-dataset`](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset)
-- **Size:** ~3,698 image-JSON pairs
-- **Split:** 80% train / 20% validation
-- **Content:**
-  - 🍔 Food images from the Food270 dataset (various cuisines, ingredients, prepared dishes)
-  - 🖼️ Non-food images (random internet images) to teach correct negative classification
-### 🔧 Two-Stage Training Strategy
-Inspired by the [SmolVLM Docling paper](https://arxiv.org/pdf/2503.11576), the fine-tuning follows a two-stage approach:
-#### Stage 1: LLM Alignment (Frozen Vision Encoder) 🧊
-- **Goal:** Teach the language model to output the desired JSON structure
-- **Frozen:** Vision encoder parameters
-- **Trainable:** LLM + connector layers
-- **Learning Rate:** `2e-4`
-- **Epochs:** 2
-- **Batch Size:** 8 (with gradient accumulation of 4)
-#### Stage 2: Full Model Fine-tuning (Unfrozen Vision Encoder) 🔥
-- **Goal:** Allow the vision encoder to adapt for better food recognition
-- **Trainable:** All parameters (vision encoder + LLM + connector)
-- **Learning Rate:** `2e-6` (much lower to prevent catastrophic forgetting)
-- **Epochs:** 2
-- **Batch Size:** 8 (with gradient accumulation of 4)
-### ⚙️ Training Configuration
-| Parameter | Stage 1 | Stage 2 |
-|---|---|---|
-| Optimizer | `adamw_torch_fused` | `adamw_torch_fused` |
-| Learning Rate | `2e-4` | `2e-6` |
-| LR Scheduler | `constant` | `constant` |
-| Warmup Ratio | `0.03` | `0.03` |
-| Max Grad Norm | `1.0` | `1.0` |
-| Precision | `bf16` | `bf16` |
-| Gradient Checkpointing | ✅ | ✅ |
-| Vision Encoder | ❄️ Frozen | 🔥 Unfrozen |
 ---
@@ -91,7 +178,7 @@ Inspired by the [SmolVLM Docling paper](https://arxiv.org/pdf/2503.11576), the f
 ### 📦 Installation
 ```bash
-pip install transformers torch gradio spaces
 ```
 ### 🔮 Inference with Pipeline
@@ -99,6 +186,7 @@ pip install transformers torch gradio spaces
 ```python
 import torch
 from transformers import pipeline
 FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3"
@@ -113,12 +201,14 @@ prompt = """Classify the given input image into food or not and if edible food o
 Only return valid JSON in the following form:
 {
   "is_food": 0,
   "image_title": "",
   "food_items": [],
   "drink_items": []
 }
 """
 messages = [
@@ -185,31 +275,54 @@ print(decoded)
 ## 🎮 Gradio Demo
 ### ▶️ Running Locally
 ```bash
-cd demos/FoodExtract-Vision-v1
 python app.py
 ```
-The demo launches a Gradio interface that lets you:
-1. 📤 Upload any image
-2. 🔄 Compare outputs from the **base model** vs. the **fine-tuned model** side-by-side
-3. 📊 See structured JSON extraction in real-time
 ---
 ## 📁 Project Structure
 ```
-demos/FoodExtract-Vision-v1/
-├── app.py              # 🚀 Gradio demo application
-├── README.md           # 📖 This file
-└── examples/           # 🖼️ Example images for the demo
-    ├── 1.jpeg     # 📷 Non-food example
-    ├── 2.jpg  # 🍗 Food example
-    └── 3.jpeg      # 🍟 Food example
 ```
 ---
@@ -218,16 +331,27 @@ demos/FoodExtract-Vision-v1/
 ### ✅ What Worked
-- 🏗️ **Two-stage training** significantly improved output quality compared to single-stage training
-- 🧊 **Freezing the vision encoder first** allowed the LLM to learn the output format without interference
-- 🐢 **Lower learning rate in Stage 2** (`2e-6` vs `2e-4`) prevented catastrophic forgetting of Stage 1 progress
 - 🤏 Even a **500M parameter model** can learn reliable structured output generation
 ### ⚠️ Important Notes
-- **Dtype consistency:** Ensure model inputs match the model's dtype (e.g., `bfloat16` inputs for a `bfloat16` model)
-- **System prompt handling:** When not using `transformers.pipeline`, the system prompt may need to be folded into the user prompt to avoid errors
-- **`remove_unused_columns = False`** is critical when using a custom data collator with `SFTTrainer`
 ---
@@ -241,11 +365,14 @@ demos/FoodExtract-Vision-v1/
 | 📄 SmolVLM Docling Paper | [arxiv.org/pdf/2503.11576](https://arxiv.org/pdf/2503.11576) |
 | 📚 TRL Documentation | [huggingface.co/docs/trl](https://huggingface.co/docs/trl/main/en/index) |
 | 📚 PEFT GitHub | [github.com/huggingface/peft](https://github.com/huggingface/peft) |
 ---
 ## 📄 License
-Please refer to the respective model and dataset cards for licensing information. The license is Apache 2.0.
 ---

+---
+title: FoodExtract-Vision
+emoji: 🍕
+colorFrom: red
+colorTo: yellow
+sdk: gradio
+sdk_version: "5.50.0"
+python_version: "3.12"
+app_file: app.py
+pinned: false
+---
 # 🍕🔍 FoodExtract-Vision v1: Fine-tuned SmolVLM2-500M for Structured Food Tag Extraction
 [![Model on HuggingFace](https://img.shields.io/badge/🤗%20Model-FoodExtract--Vision--SmolVLM2--500M-blue)](https://huggingface.co/berkeruveyik/FoodExtract-Vision-SmolVLM2-500M-fine-tune-v3)
 [![Dataset on HuggingFace](https://img.shields.io/badge/🤗%20Dataset-vlm--food--4k--not--food-green)](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset)
+[![Base Model](https://img.shields.io/badge/🧠%20Base-SmolVLM2--500M--Video--Instruct-orange)](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct)
+[![License](https://img.shields.io/badge/📄%20License-Apache%202.0-lightgrey)](https://www.apache.org/licenses/LICENSE-2.0)
+---
+## 📋 Overview
+**FoodExtract-Vision** is a fine-tuned Vision-Language Model (VLM) that takes any image as input and produces **structured JSON output** classifying whether food/drink items are visible and extracting them into organized lists.
+Built on top of [SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct), this project demonstrates that even **small (~500M parameter) VLMs** can be fine-tuned to reliably produce structured outputs for domain-specific tasks — without needing PEFT/LoRA adapters.
+> 💡 **Key Insight:** The base model often fails to follow the required JSON output structure, producing inconsistent or unstructured responses. After two-stage fine-tuning, the model **reliably generates valid JSON** matching the specified schema.
+---
+## 🎯 What Does It Do?
+| | Input | Output |
+|---|---|---|
+| 📸 | Any image (food or non-food) | Structured JSON |
+### Output Schema
 ```json
 {
 }
 ```
+| Field | Type | Description |
+|---|---|---|
+| `is_food` | `int` | `0` = no food/drink visible, `1` = food/drink visible |
+| `image_title` | `str` | Short food-related caption (blank if no food) |
+| `food_items` | `list[str]` | List of visible edible food item nouns |
+| `drink_items` | `list[str]` | List of visible edible drink item nouns |
 ---
+## 🛠️ What Was Done — End-to-End Pipeline
+This project covers the **full ML lifecycle** from dataset creation to deployment:
+### Step 1: 📊 Dataset Creation (`00_create_vlm_dataset.ipynb`)
+1. 🏷️ Loaded food labels from `data/food_dataset-2.jsonl` (generated via Qwen3-VL-8B inference on Food270 images)
+2. 📝 Added metadata fields (`image_id`, `image_name`, `food270_class_name`, `image_source`)
+3. 🖼️ Sampled **not-food images** from `data/not_food/` and created empty labels with `is_food = 0`
+4. 🔀 Merged food + not-food labels into a unified dataset
+5. 📁 Copied all images into `data/food_all/` and wrote `metadata.jsonl` for HuggingFace `imagefolder` format
+6. 🚀 Pushed to HuggingFace Hub as [`berkeruveyik/vlm-food-4k-not-food-dataset`](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset)
+**Final dataset:** ~3,698 image-JSON pairs across **270 food categories** + not-food images
+### Step 2: 🧪 Base Model Evaluation (`01_fine_tune_vlm_v3_smolVLM_500m.ipynb`)
+- Tested `SmolVLM2-500M-Video-Instruct` on the food extraction task
+- **Result:** The base model produced unstructured text like *"The given image is a food or drink item."* instead of valid JSON
+- ❌ Base model **cannot** follow the structured output format
+### Step 3: 📐 Data Formatting for SFT
+Converted each sample to a **conversational message format** with three roles:
+```
+[SYSTEM] → Expert food extractor persona
+[USER]   → Image + JSON extraction prompt
+[ASSISTANT] → Ground truth JSON output
+```
+- Used `PIL.Image` objects directly (not bytes) to preserve image quality
+- 80/20 train/validation split with `random.seed(42)` for reproducibility
+### Step 4: 🧊 Stage 1 Training — Frozen Vision Encoder
+- **Froze** the vision encoder (`model.model.vision_model`)
+- **Trained** only the LLM + connector layers
+- **Goal:** Teach the language model to output valid JSON structure
+- Used `SFTTrainer` from TRL with custom `collate_fn` for image-text batching
+### Step 5: 🔥 Stage 2 Training — Full Model Fine-tuning
+- **Unfroze** the vision encoder
+- **Trained** all parameters with a **100x lower learning rate** (`2e-6` vs `2e-4`)
+- **Goal:** Allow the vision encoder to adapt for better food recognition without catastrophic forgetting
+### Step 6: 📈 Evaluation & Comparison
+- Compared outputs from 3 models side-by-side:
+  - 🔴 **Pre-trained** (base model) — fails at structured output
+  - 🟡 **Stage 1** (frozen vision) — learns JSON format
+  - 🟢 **Stage 2** (full fine-tune) — best food recognition + JSON format
+### Step 7: 🚀 Deployment
+- Uploaded fine-tuned model to HuggingFace Hub
+- Built Gradio demo with side-by-side comparison
+- Deployed as a HuggingFace Space
+---
+## 🏗️ Architecture & Training Details
+### 🧠 Base Model
+| Property | Value |
+|---|---|
+| Model | `HuggingFaceTB/SmolVLM2-500M-Video-Instruct` |
+| Parameters | ~500M |
+| Precision | `bfloat16` |
+| Attention | `eager` |
+### 📊 Dataset
+| Property | Value |
+|---|---|
+| Source | [`berkeruveyik/vlm-food-4k-not-food-dataset`](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset) |
+| Total Samples | ~3,698 image-JSON pairs |
+| Train / Val Split | 80% / 20% |
+| Food Categories | 270 (from Food270 dataset) |
+| Non-food Images | Random internet images |
+| Label Source | Qwen3-VL-8B inference outputs |
+### 🔧 Two-Stage Training Strategy
+Inspired by the [SmolVLM Docling paper](https://arxiv.org/pdf/2503.11576):
+#### 🧊 Stage 1: LLM Alignment (Frozen Vision Encoder)
+| Parameter | Value |
+|---|---|
+| Vision Encoder | ❄️ Frozen |
+| Trainable | LLM + connector layers |
+| Learning Rate | `2e-4` |
+| Epochs | 2 |
+| Batch Size | 8 × 4 gradient accumulation = effective 32 |
+| Optimizer | `adamw_torch_fused` |
+| LR Scheduler | `constant` |
+| Warmup Ratio | `0.03` |
+| Precision | `bf16` |
+#### 🔥 Stage 2: Full Model Fine-tuning (Unfrozen Vision Encoder)
+| Parameter | Value |
+|---|---|
+| Vision Encoder | 🔥 Unfrozen |
+| Trainable | All parameters |
+| Learning Rate | `2e-6` (100x lower than Stage 1) |
+| Epochs | 2 |
+| Batch Size | 8 × 4 gradient accumulation = effective 32 |
+| Optimizer | `adamw_torch_fused` |
+| LR Scheduler | `constant` |
+| Warmup Ratio | `0.03` |
+| Precision | `bf16` |
 ---
 ### 📦 Installation
 ```bash
+pip install transformers torch gradio spaces accelerate
 ```
 ### 🔮 Inference with Pipeline
 ```python
 import torch
 from transformers import pipeline
+from PIL import Image
 FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3"
 Only return valid JSON in the following form:
+```json
 {
   "is_food": 0,
   "image_title": "",
   "food_items": [],
   "drink_items": []
 }
+```
 """
 messages = [
 ## 🎮 Gradio Demo
+This Space runs a **side-by-side comparison** between the base model and the fine-tuned model.
 ### ▶️ Running Locally
 ```bash
+cd demos/FoodExtract-Vision
+pip install -r requirements.txt
 python app.py
 ```
+### 🖥️ What the Demo Shows
+1. 📤 **Upload** any image
+2. 🔄 **Compare** outputs from the base model vs. the fine-tuned model side-by-side
+3. 📊 See how fine-tuning enables **reliable structured JSON extraction**
+### 📸 Example Images Included
+The demo comes with pre-loaded examples to try instantly.
 ---
 ## 📁 Project Structure
 ```
+vlm_finetune/
+├── 📓 00_create_vlm_dataset.ipynb          # Dataset creation pipeline
+├── 📓 01-fine_tune_vlm.ipynb               # First fine-tuning experiment (Gemma-3n)
+├── 📓 01-fine_tune_vlm-v2-smolVLM.ipynb    # SmolVLM 256M experiment
+├── 📓 01_fine_tune_vlm_v3_smolVLM_500m.ipynb # ✅ Final: SmolVLM 500M two-stage training
+├── 📓 qwen3-food270-inference-viewer.ipynb  # Dataset visualization tool
+├── 📄 README.md                            # Root project README
+├── 📁 data/
+│   ├── food_dataset-2.jsonl                # Qwen3-VL-8B inference outputs
+│   ├── food_labels_updated.json            # Processed food labels
+│   ├── 📁 10_images_270_class/             # 10 sample images per category
+│   ├── 📁 food_all/                        # Merged dataset (food + not-food)
+│   │   └── metadata.jsonl                  # HuggingFace imagefolder metadata
+│   └── 📁 not_food/                        # Non-food images
+└── 📁 demos/
+    └── 📁 FoodExtract-Vision/
+        ├── app.py                          # 🚀 Gradio demo application
+        ├── README.md                       # 📖 This file
+        ├── requirements.txt                # 📦 Python dependencies
+        └── 📁 examples/                    # 🖼️ Example images
+            ├── 36741.jpg
+            ├── IMG_3808.JPG
+            └── istockphoto-175500494-612x612.jpg
 ```
 ---
 ### ✅ What Worked
+- 🏗️ **Two-stage training** significantly improved output quality compared to single-stage
+- 🧊 **Freezing the vision encoder first** let the LLM learn JSON format without vision interference
+- 🐢 **100x lower learning rate in Stage 2** (`2e-6` vs `2e-4`) prevented catastrophic forgetting
 - 🤏 Even a **500M parameter model** can learn reliable structured output generation
+- 📝 **Custom `collate_fn`** with proper label masking (pad tokens + image tokens → `-100`) was essential
+- 🔀 **`remove_unused_columns = False`** is critical when using a custom data collator with `SFTTrainer`
 ### ⚠️ Important Notes
+- **Dtype consistency:** Model inputs must match the model's dtype (e.g., `bfloat16` inputs for a `bfloat16` model)
+- **System prompt handling:** When not using `transformers.pipeline`, the system prompt may need to be folded into the user prompt
+- **PIL images over bytes:** Using `format_data()` as a list comprehension instead of `dataset.map()` preserves PIL image types
+- **Gradient checkpointing:** Set `use_reentrant=False` to avoid warnings and ensure compatibility
+### 🧪 Experiments Tried
+| Notebook | Model | Approach | Result |
+|---|---|---|---|
+| `01-fine_tune_vlm.ipynb` | Gemma-3n-E2B | QLoRA + PEFT | ✅ Works but larger model |
+| `01-fine_tune_vlm-v2-smolVLM.ipynb` | SmolVLM2-256M | Full fine-tune | 🟡 Limited capacity |
+| `01_fine_tune_vlm_v3_smolVLM_500m.ipynb` | SmolVLM2-500M | **Two-stage full fine-tune** | ✅ **Best results** |
 ---
 | 📄 SmolVLM Docling Paper | [arxiv.org/pdf/2503.11576](https://arxiv.org/pdf/2503.11576) |
 | 📚 TRL Documentation | [huggingface.co/docs/trl](https://huggingface.co/docs/trl/main/en/index) |
 | 📚 PEFT GitHub | [github.com/huggingface/peft](https://github.com/huggingface/peft) |
+| 📚 HF Vision Fine-tune Guide | [ai.google.dev/gemma/docs](https://ai.google.dev/gemma/docs/core/huggingface_vision_finetune_qlora?hl=tr) |
 ---
 ## 📄 License
+This project uses Apache 2.0 license. Please refer to the respective model and dataset cards for additional licensing information.
 ---
+*Built with ❤️ using 🤗 Transformers, TRL, and Gradio — by [Berker Üveyik](https://huggingface.co/berkeruveyik)*