Spaces:
Sleeping
Sleeping
| title: FoodExtract-Vision | |
| emoji: ๐ | |
| colorFrom: red | |
| colorTo: yellow | |
| sdk: gradio | |
| sdk_version: "5.50.0" | |
| python_version: "3.12" | |
| app_file: app.py | |
| pinned: false | |
| # ๐๐ FoodExtract-Vision v1: Fine-tuned SmolVLM2-500M for Structured Food Tag Extraction | |
| [](https://huggingface.co/berkeruveyik/FoodExtract-Vision-SmolVLM2-500M-fine-tune-v3) | |
| [](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset) | |
| [](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) | |
| [](https://www.apache.org/licenses/LICENSE-2.0) | |
| --- | |
| ## ๐ Overview | |
| **FoodExtract-Vision** is a fine-tuned Vision-Language Model (VLM) that takes any image as input and produces **structured JSON output** classifying whether food/drink items are visible and extracting them into organized lists. | |
| Built on top of [SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct), this project demonstrates that even **small (~500M parameter) VLMs** can be fine-tuned to reliably produce structured outputs for domain-specific tasks โ without needing PEFT/LoRA adapters. | |
| > ๐ก **Key Insight:** The base model often fails to follow the required JSON output structure, producing inconsistent or unstructured responses. After two-stage fine-tuning, the model **reliably generates valid JSON** matching the specified schema. | |
| --- | |
| ## ๐ฏ What Does It Do? | |
| | | Input | Output | | |
| |---|---|---| | |
| | ๐ธ | Any image (food or non-food) | Structured JSON | | |
| ### Output Schema | |
| ```json | |
| { | |
| "is_food": 1, | |
| "image_title": "Tandoori chicken with naan bread", | |
| "food_items": ["tandoori chicken", "naan bread", "rice", "salad"], | |
| "drink_items": ["lassi"] | |
| } | |
| ``` | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `is_food` | `int` | `0` = no food/drink visible, `1` = food/drink visible | | |
| | `image_title` | `str` | Short food-related caption (blank if no food) | | |
| | `food_items` | `list[str]` | List of visible edible food item nouns | | |
| | `drink_items` | `list[str]` | List of visible edible drink item nouns | | |
| --- | |
| ## ๐ ๏ธ What Was Done โ End-to-End Pipeline | |
| This project covers the **full ML lifecycle** from dataset creation to deployment: | |
| ### Step 1: ๐ Dataset Creation (`00_create_vlm_dataset.ipynb`) | |
| 1. ๐ท๏ธ Loaded food labels from `data/food_dataset-2.jsonl` (generated via Qwen3-VL-8B inference on Food270 images) | |
| 2. ๐ Added metadata fields (`image_id`, `image_name`, `food270_class_name`, `image_source`) | |
| 3. ๐ผ๏ธ Sampled **not-food images** from `data/not_food/` and created empty labels with `is_food = 0` | |
| 4. ๐ Merged food + not-food labels into a unified dataset | |
| 5. ๐ Copied all images into `data/food_all/` and wrote `metadata.jsonl` for HuggingFace `imagefolder` format | |
| 6. ๐ Pushed to HuggingFace Hub as [`berkeruveyik/vlm-food-4k-not-food-dataset`](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset) | |
| **Final dataset:** ~3,698 image-JSON pairs across **270 food categories** + not-food images | |
| ### Step 2: ๐งช Base Model Evaluation (`01_fine_tune_vlm_v3_smolVLM_500m.ipynb`) | |
| - Tested `SmolVLM2-500M-Video-Instruct` on the food extraction task | |
| - **Result:** The base model produced unstructured text like *"The given image is a food or drink item."* instead of valid JSON | |
| - โ Base model **cannot** follow the structured output format | |
| ### Step 3: ๐ Data Formatting for SFT | |
| Converted each sample to a **conversational message format** with three roles: | |
| ``` | |
| [SYSTEM] โ Expert food extractor persona | |
| [USER] โ Image + JSON extraction prompt | |
| [ASSISTANT] โ Ground truth JSON output | |
| ``` | |
| - Used `PIL.Image` objects directly (not bytes) to preserve image quality | |
| - 80/20 train/validation split with `random.seed(42)` for reproducibility | |
| ### Step 4: ๐ง Stage 1 Training โ Frozen Vision Encoder | |
| - **Froze** the vision encoder (`model.model.vision_model`) | |
| - **Trained** only the LLM + connector layers | |
| - **Goal:** Teach the language model to output valid JSON structure | |
| - Used `SFTTrainer` from TRL with custom `collate_fn` for image-text batching | |
| ### Step 5: ๐ฅ Stage 2 Training โ Full Model Fine-tuning | |
| - **Unfroze** the vision encoder | |
| - **Trained** all parameters with a **100x lower learning rate** (`2e-6` vs `2e-4`) | |
| - **Goal:** Allow the vision encoder to adapt for better food recognition without catastrophic forgetting | |
| ### Step 6: ๐ Evaluation & Comparison | |
| - Compared outputs from 3 models side-by-side: | |
| - ๐ด **Pre-trained** (base model) โ fails at structured output | |
| - ๐ก **Stage 1** (frozen vision) โ learns JSON format | |
| - ๐ข **Stage 2** (full fine-tune) โ best food recognition + JSON format | |
| ### Step 7: ๐ Deployment | |
| - Uploaded fine-tuned model to HuggingFace Hub | |
| - Built Gradio demo with side-by-side comparison | |
| - Deployed as a HuggingFace Space | |
| --- | |
| ## ๐๏ธ Architecture & Training Details | |
| ### ๐ง Base Model | |
| | Property | Value | | |
| |---|---| | |
| | Model | `HuggingFaceTB/SmolVLM2-500M-Video-Instruct` | | |
| | Parameters | ~500M | | |
| | Precision | `bfloat16` | | |
| | Attention | `eager` | | |
| ### ๐ Dataset | |
| | Property | Value | | |
| |---|---| | |
| | Source | [`berkeruveyik/vlm-food-4k-not-food-dataset`](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset) | | |
| | Total Samples | ~3,698 image-JSON pairs | | |
| | Train / Val Split | 80% / 20% | | |
| | Food Categories | 270 (from Food270 dataset) | | |
| | Non-food Images | Random internet images | | |
| | Label Source | Qwen3-VL-8B inference outputs | | |
| ### ๐ง Two-Stage Training Strategy | |
| Inspired by the [SmolVLM Docling paper](https://arxiv.org/pdf/2503.11576): | |
| #### ๐ง Stage 1: LLM Alignment (Frozen Vision Encoder) | |
| | Parameter | Value | | |
| |---|---| | |
| | Vision Encoder | โ๏ธ Frozen | | |
| | Trainable | LLM + connector layers | | |
| | Learning Rate | `2e-4` | | |
| | Epochs | 2 | | |
| | Batch Size | 8 ร 4 gradient accumulation = effective 32 | | |
| | Optimizer | `adamw_torch_fused` | | |
| | LR Scheduler | `constant` | | |
| | Warmup Ratio | `0.03` | | |
| | Precision | `bf16` | | |
| #### ๐ฅ Stage 2: Full Model Fine-tuning (Unfrozen Vision Encoder) | |
| | Parameter | Value | | |
| |---|---| | |
| | Vision Encoder | ๐ฅ Unfrozen | | |
| | Trainable | All parameters | | |
| | Learning Rate | `2e-6` (100x lower than Stage 1) | | |
| | Epochs | 2 | | |
| | Batch Size | 8 ร 4 gradient accumulation = effective 32 | | |
| | Optimizer | `adamw_torch_fused` | | |
| | LR Scheduler | `constant` | | |
| | Warmup Ratio | `0.03` | | |
| | Precision | `bf16` | | |
| --- | |
| ## ๐ Quick Start | |
| ### ๐ฆ Installation | |
| ```bash | |
| pip install transformers torch gradio spaces accelerate | |
| ``` | |
| ### ๐ฎ Inference with Pipeline | |
| ```python | |
| import torch | |
| from transformers import pipeline | |
| from PIL import Image | |
| FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3" | |
| pipe = pipeline( | |
| "image-text-to-text", | |
| model=FINE_TUNED_MODEL_ID, | |
| dtype=torch.bfloat16, | |
| device_map="auto", | |
| ) | |
| prompt = """Classify the given input image into food or not and if edible food or drink items are present, extract those to a list. If no food/drink items are visible, return empty lists. | |
| Only return valid JSON in the following form: | |
| ```json | |
| { | |
| "is_food": 0, | |
| "image_title": "", | |
| "food_items": [], | |
| "drink_items": [] | |
| } | |
| ``` | |
| """ | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "image", "image": "path/to/your/image.jpg"}, | |
| {"type": "text", "text": prompt}, | |
| ], | |
| } | |
| ] | |
| output = pipe(text=messages, max_new_tokens=256) | |
| print(output[0][0]["generated_text"][-1]["content"]) | |
| ``` | |
| ### ๐งช Inference without Pipeline | |
| ```python | |
| import torch | |
| from transformers import AutoModelForImageTextToText, AutoProcessor | |
| from PIL import Image | |
| FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3" | |
| model = AutoModelForImageTextToText.from_pretrained( | |
| FINE_TUNED_MODEL_ID, | |
| attn_implementation="eager", | |
| dtype=torch.bfloat16, | |
| device_map="auto", | |
| ) | |
| processor = AutoProcessor.from_pretrained(FINE_TUNED_MODEL_ID) | |
| image = Image.open("path/to/your/image.jpg") | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "image", "image": image}, | |
| {"type": "text", "text": "YOUR_PROMPT_HERE"}, | |
| ], | |
| } | |
| ] | |
| inputs = processor.apply_chat_template( | |
| messages, | |
| add_generation_prompt=True, | |
| tokenize=True, | |
| return_dict=True, | |
| return_tensors="pt", | |
| ).to(model.device) | |
| input_len = inputs["input_ids"].shape[-1] | |
| with torch.inference_mode(): | |
| output = model.generate(**inputs, max_new_tokens=256, do_sample=False) | |
| decoded = processor.decode(output[0][input_len:], skip_special_tokens=True) | |
| print(decoded) | |
| ``` | |
| --- | |
| ## ๐ฎ Gradio Demo | |
| This Space runs a **side-by-side comparison** between the base model and the fine-tuned model. | |
| ### โถ๏ธ Running Locally | |
| ```bash | |
| cd demos/FoodExtract-Vision | |
| pip install -r requirements.txt | |
| python app.py | |
| ``` | |
| ### ๐ฅ๏ธ What the Demo Shows | |
| 1. ๐ค **Upload** any image | |
| 2. ๐ **Compare** outputs from the base model vs. the fine-tuned model side-by-side | |
| 3. ๐ See how fine-tuning enables **reliable structured JSON extraction** | |
| ### ๐ธ Example Images Included | |
| The demo comes with pre-loaded examples to try instantly. | |
| --- | |
| ## ๐ Project Structure | |
| ``` | |
| vlm_finetune/ | |
| โโโ ๐ 00_create_vlm_dataset.ipynb # Dataset creation pipeline | |
| โโโ ๐ 01-fine_tune_vlm.ipynb # First fine-tuning experiment (Gemma-3n) | |
| โโโ ๐ 01-fine_tune_vlm-v2-smolVLM.ipynb # SmolVLM 256M experiment | |
| โโโ ๐ 01_fine_tune_vlm_v3_smolVLM_500m.ipynb # โ Final: SmolVLM 500M two-stage training | |
| โโโ ๐ qwen3-food270-inference-viewer.ipynb # Dataset visualization tool | |
| โโโ ๐ README.md # Root project README | |
| โโโ ๐ data/ | |
| โ โโโ food_dataset-2.jsonl # Qwen3-VL-8B inference outputs | |
| โ โโโ food_labels_updated.json # Processed food labels | |
| โ โโโ ๐ 10_images_270_class/ # 10 sample images per category | |
| โ โโโ ๐ food_all/ # Merged dataset (food + not-food) | |
| โ โ โโโ metadata.jsonl # HuggingFace imagefolder metadata | |
| โ โโโ ๐ not_food/ # Non-food images | |
| โโโ ๐ demos/ | |
| โโโ ๐ FoodExtract-Vision/ | |
| โโโ app.py # ๐ Gradio demo application | |
| โโโ README.md # ๐ This file | |
| โโโ requirements.txt # ๐ฆ Python dependencies | |
| โโโ ๐ examples/ # ๐ผ๏ธ Example images | |
| โโโ 36741.jpg | |
| โโโ IMG_3808.JPG | |
| โโโ istockphoto-175500494-612x612.jpg | |
| ``` | |
| --- | |
| ## ๐ Key Learnings & Notes | |
| ### โ What Worked | |
| - ๐๏ธ **Two-stage training** significantly improved output quality compared to single-stage | |
| - ๐ง **Freezing the vision encoder first** let the LLM learn JSON format without vision interference | |
| - ๐ข **100x lower learning rate in Stage 2** (`2e-6` vs `2e-4`) prevented catastrophic forgetting | |
| - ๐ค Even a **500M parameter model** can learn reliable structured output generation | |
| - ๐ **Custom `collate_fn`** with proper label masking (pad tokens + image tokens โ `-100`) was essential | |
| - ๐ **`remove_unused_columns = False`** is critical when using a custom data collator with `SFTTrainer` | |
| ### โ ๏ธ Important Notes | |
| - **Dtype consistency:** Model inputs must match the model's dtype (e.g., `bfloat16` inputs for a `bfloat16` model) | |
| - **System prompt handling:** When not using `transformers.pipeline`, the system prompt may need to be folded into the user prompt | |
| - **PIL images over bytes:** Using `format_data()` as a list comprehension instead of `dataset.map()` preserves PIL image types | |
| - **Gradient checkpointing:** Set `use_reentrant=False` to avoid warnings and ensure compatibility | |
| ### ๐งช Experiments Tried | |
| | Notebook | Model | Approach | Result | | |
| |---|---|---|---| | |
| | `01-fine_tune_vlm.ipynb` | Gemma-3n-E2B | QLoRA + PEFT | โ Works but larger model | | |
| | `01-fine_tune_vlm-v2-smolVLM.ipynb` | SmolVLM2-256M | Full fine-tune | ๐ก Limited capacity | | |
| | `01_fine_tune_vlm_v3_smolVLM_500m.ipynb` | SmolVLM2-500M | **Two-stage full fine-tune** | โ **Best results** | | |
| --- | |
| ## ๐ Links | |
| | Resource | URL | | |
| |---|---| | |
| | ๐ค Fine-tuned Model | [berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3](https://huggingface.co/berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3) | | |
| | ๐ค Dataset | [berkeruveyik/vlm-food-4k-not-food-dataset](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset) | | |
| | ๐ค Base Model | [HuggingFaceTB/SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) | | |
| | ๐ SmolVLM Docling Paper | [arxiv.org/pdf/2503.11576](https://arxiv.org/pdf/2503.11576) | | |
| | ๐ TRL Documentation | [huggingface.co/docs/trl](https://huggingface.co/docs/trl/main/en/index) | | |
| | ๐ PEFT GitHub | [github.com/huggingface/peft](https://github.com/huggingface/peft) | | |
| | ๐ HF Vision Fine-tune Guide | [ai.google.dev/gemma/docs](https://ai.google.dev/gemma/docs/core/huggingface_vision_finetune_qlora?hl=tr) | | |
| --- | |
| ## ๐ License | |
| This project uses Apache 2.0 license. Please refer to the respective model and dataset cards for additional licensing information. | |
| --- | |
| *Built with โค๏ธ using ๐ค Transformers, TRL, and Gradio โ by [Berker รveyik](https://huggingface.co/berkeruveyik)* | |