File size: 13,779 Bytes
4a1815c
 
 
 
 
 
 
 
 
 
 
 
9758569
 
 
 
4a1815c
 
9758569
4a1815c
9758569
4a1815c
9758569
4a1815c
9758569
4a1815c
3162c9a
4a1815c
9758569
4a1815c
9758569
4a1815c
9758569
4a1815c
 
 
9758569
4a1815c
9758569
 
 
 
 
 
 
 
 
 
4a1815c
 
 
 
 
 
 
9758569
 
4a1815c
9758569
4a1815c
9758569
4a1815c
9758569
4a1815c
 
 
 
 
 
9758569
4a1815c
9758569
4a1815c
9758569
4a1815c
 
 
9758569
4a1815c
9758569
4a1815c
9758569
4a1815c
 
 
 
 
9758569
4a1815c
 
9758569
4a1815c
9758569
4a1815c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9758569
 
 
 
 
 
 
 
4a1815c
9758569
 
 
 
 
 
 
4a1815c
9758569
 
 
 
 
 
 
 
 
 
 
 
 
 
4a1815c
9758569
 
 
 
 
 
4a1815c
9758569
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4a1815c
 
9758569
 
 
4a1815c
 
9758569
 
 
4a1815c
 
 
 
 
 
 
9758569
4a1815c
9758569
 
 
 
 
 
4a1815c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9758569
 
 
 
 
 
 
 
4a1815c
 
 
9758569
4a1815c
 
9758569
 
 
4a1815c
 
 
 
 
 
 
 
 
 
 
 
9758569
 
 
 
 
 
 
 
 
 
 
 
 
4a1815c
9758569
 
 
 
 
4a1815c
 
0948744
 
4a1815c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
---
title: FoodExtract-Vision
emoji: ๐Ÿ•
colorFrom: red
colorTo: yellow
sdk: gradio
sdk_version: "5.50.0"
python_version: "3.12"
app_file: app.py
pinned: false
---

# ๐Ÿ•๐Ÿ” FoodExtract-Vision v1: Fine-tuned SmolVLM2-500M for Structured Food Tag Extraction

[![Model on HuggingFace](https://img.shields.io/badge/๐Ÿค—%20Model-FoodExtract--Vision--SmolVLM2--500M-blue)](https://huggingface.co/berkeruveyik/FoodExtract-Vision-SmolVLM2-500M-fine-tune-v3)
[![Dataset on HuggingFace](https://img.shields.io/badge/๐Ÿค—%20Dataset-vlm--food--4k--not--food-green)](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset)
[![Base Model](https://img.shields.io/badge/๐Ÿง %20Base-SmolVLM2--500M--Video--Instruct-orange)](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct)
[![License](https://img.shields.io/badge/๐Ÿ“„%20License-Apache%202.0-lightgrey)](https://www.apache.org/licenses/LICENSE-2.0)

---

## ๐Ÿ“‹ Overview

**FoodExtract-Vision** is a fine-tuned Vision-Language Model (VLM) that takes any image as input and produces **structured JSON output** classifying whether food/drink items are visible and extracting them into organized lists.

Built on top of [SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct), this project demonstrates that even **small (~500M parameter) VLMs** can be fine-tuned to reliably produce structured outputs for domain-specific tasks โ€” without needing PEFT/LoRA adapters.

> ๐Ÿ’ก **Key Insight:** The base model often fails to follow the required JSON output structure, producing inconsistent or unstructured responses. After two-stage fine-tuning, the model **reliably generates valid JSON** matching the specified schema.

---

## ๐ŸŽฏ What Does It Do?

| | Input | Output |
|---|---|---|
| ๐Ÿ“ธ | Any image (food or non-food) | Structured JSON |

### Output Schema

```json
{
  "is_food": 1,
  "image_title": "Tandoori chicken with naan bread",
  "food_items": ["tandoori chicken", "naan bread", "rice", "salad"],
  "drink_items": ["lassi"]
}
```

| Field | Type | Description |
|---|---|---|
| `is_food` | `int` | `0` = no food/drink visible, `1` = food/drink visible |
| `image_title` | `str` | Short food-related caption (blank if no food) |
| `food_items` | `list[str]` | List of visible edible food item nouns |
| `drink_items` | `list[str]` | List of visible edible drink item nouns |

---

## ๐Ÿ› ๏ธ What Was Done โ€” End-to-End Pipeline

This project covers the **full ML lifecycle** from dataset creation to deployment:

### Step 1: ๐Ÿ“Š Dataset Creation (`00_create_vlm_dataset.ipynb`)

1. ๐Ÿท๏ธ Loaded food labels from `data/food_dataset-2.jsonl` (generated via Qwen3-VL-8B inference on Food270 images)
2. ๐Ÿ“ Added metadata fields (`image_id`, `image_name`, `food270_class_name`, `image_source`)
3. ๐Ÿ–ผ๏ธ Sampled **not-food images** from `data/not_food/` and created empty labels with `is_food = 0`
4. ๐Ÿ”€ Merged food + not-food labels into a unified dataset
5. ๐Ÿ“ Copied all images into `data/food_all/` and wrote `metadata.jsonl` for HuggingFace `imagefolder` format
6. ๐Ÿš€ Pushed to HuggingFace Hub as [`berkeruveyik/vlm-food-4k-not-food-dataset`](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset)

**Final dataset:** ~3,698 image-JSON pairs across **270 food categories** + not-food images

### Step 2: ๐Ÿงช Base Model Evaluation (`01_fine_tune_vlm_v3_smolVLM_500m.ipynb`)

- Tested `SmolVLM2-500M-Video-Instruct` on the food extraction task
- **Result:** The base model produced unstructured text like *"The given image is a food or drink item."* instead of valid JSON
- โŒ Base model **cannot** follow the structured output format

### Step 3: ๐Ÿ“ Data Formatting for SFT

Converted each sample to a **conversational message format** with three roles:

```
[SYSTEM] โ†’ Expert food extractor persona
[USER]   โ†’ Image + JSON extraction prompt
[ASSISTANT] โ†’ Ground truth JSON output
```

- Used `PIL.Image` objects directly (not bytes) to preserve image quality
- 80/20 train/validation split with `random.seed(42)` for reproducibility

### Step 4: ๐ŸงŠ Stage 1 Training โ€” Frozen Vision Encoder

- **Froze** the vision encoder (`model.model.vision_model`)
- **Trained** only the LLM + connector layers
- **Goal:** Teach the language model to output valid JSON structure
- Used `SFTTrainer` from TRL with custom `collate_fn` for image-text batching

### Step 5: ๐Ÿ”ฅ Stage 2 Training โ€” Full Model Fine-tuning

- **Unfroze** the vision encoder
- **Trained** all parameters with a **100x lower learning rate** (`2e-6` vs `2e-4`)
- **Goal:** Allow the vision encoder to adapt for better food recognition without catastrophic forgetting

### Step 6: ๐Ÿ“ˆ Evaluation & Comparison

- Compared outputs from 3 models side-by-side:
  - ๐Ÿ”ด **Pre-trained** (base model) โ€” fails at structured output
  - ๐ŸŸก **Stage 1** (frozen vision) โ€” learns JSON format
  - ๐ŸŸข **Stage 2** (full fine-tune) โ€” best food recognition + JSON format

### Step 7: ๐Ÿš€ Deployment

- Uploaded fine-tuned model to HuggingFace Hub
- Built Gradio demo with side-by-side comparison
- Deployed as a HuggingFace Space

---

## ๐Ÿ—๏ธ Architecture & Training Details

### ๐Ÿง  Base Model

| Property | Value |
|---|---|
| Model | `HuggingFaceTB/SmolVLM2-500M-Video-Instruct` |
| Parameters | ~500M |
| Precision | `bfloat16` |
| Attention | `eager` |

### ๐Ÿ“Š Dataset

| Property | Value |
|---|---|
| Source | [`berkeruveyik/vlm-food-4k-not-food-dataset`](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset) |
| Total Samples | ~3,698 image-JSON pairs |
| Train / Val Split | 80% / 20% |
| Food Categories | 270 (from Food270 dataset) |
| Non-food Images | Random internet images |
| Label Source | Qwen3-VL-8B inference outputs |

### ๐Ÿ”ง Two-Stage Training Strategy

Inspired by the [SmolVLM Docling paper](https://arxiv.org/pdf/2503.11576):

#### ๐ŸงŠ Stage 1: LLM Alignment (Frozen Vision Encoder)

| Parameter | Value |
|---|---|
| Vision Encoder | โ„๏ธ Frozen |
| Trainable | LLM + connector layers |
| Learning Rate | `2e-4` |
| Epochs | 2 |
| Batch Size | 8 ร— 4 gradient accumulation = effective 32 |
| Optimizer | `adamw_torch_fused` |
| LR Scheduler | `constant` |
| Warmup Ratio | `0.03` |
| Precision | `bf16` |

#### ๐Ÿ”ฅ Stage 2: Full Model Fine-tuning (Unfrozen Vision Encoder)

| Parameter | Value |
|---|---|
| Vision Encoder | ๐Ÿ”ฅ Unfrozen |
| Trainable | All parameters |
| Learning Rate | `2e-6` (100x lower than Stage 1) |
| Epochs | 2 |
| Batch Size | 8 ร— 4 gradient accumulation = effective 32 |
| Optimizer | `adamw_torch_fused` |
| LR Scheduler | `constant` |
| Warmup Ratio | `0.03` |
| Precision | `bf16` |

---

## ๐Ÿš€ Quick Start

### ๐Ÿ“ฆ Installation

```bash
pip install transformers torch gradio spaces accelerate
```

### ๐Ÿ”ฎ Inference with Pipeline

```python
import torch
from transformers import pipeline
from PIL import Image

FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3"

pipe = pipeline(
    "image-text-to-text",
    model=FINE_TUNED_MODEL_ID,
    dtype=torch.bfloat16,
    device_map="auto",
)

prompt = """Classify the given input image into food or not and if edible food or drink items are present, extract those to a list. If no food/drink items are visible, return empty lists.

Only return valid JSON in the following form:

```json
{
  "is_food": 0,
  "image_title": "",
  "food_items": [],
  "drink_items": []
}
```
"""

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/your/image.jpg"},
            {"type": "text", "text": prompt},
        ],
    }
]

output = pipe(text=messages, max_new_tokens=256)
print(output[0][0]["generated_text"][-1]["content"])
```

### ๐Ÿงช Inference without Pipeline

```python
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3"

model = AutoModelForImageTextToText.from_pretrained(
    FINE_TUNED_MODEL_ID,
    attn_implementation="eager",
    dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(FINE_TUNED_MODEL_ID)

image = Image.open("path/to/your/image.jpg")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "YOUR_PROMPT_HERE"},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    output = model.generate(**inputs, max_new_tokens=256, do_sample=False)

decoded = processor.decode(output[0][input_len:], skip_special_tokens=True)
print(decoded)
```

---

## ๐ŸŽฎ Gradio Demo

This Space runs a **side-by-side comparison** between the base model and the fine-tuned model.

### โ–ถ๏ธ Running Locally

```bash
cd demos/FoodExtract-Vision
pip install -r requirements.txt
python app.py
```

### ๐Ÿ–ฅ๏ธ What the Demo Shows

1. ๐Ÿ“ค **Upload** any image
2. ๐Ÿ”„ **Compare** outputs from the base model vs. the fine-tuned model side-by-side
3. ๐Ÿ“Š See how fine-tuning enables **reliable structured JSON extraction**

### ๐Ÿ“ธ Example Images Included

The demo comes with pre-loaded examples to try instantly.

---

## ๐Ÿ“ Project Structure

```
vlm_finetune/
โ”œโ”€โ”€ ๐Ÿ““ 00_create_vlm_dataset.ipynb          # Dataset creation pipeline
โ”œโ”€โ”€ ๐Ÿ““ 01-fine_tune_vlm.ipynb               # First fine-tuning experiment (Gemma-3n)
โ”œโ”€โ”€ ๐Ÿ““ 01-fine_tune_vlm-v2-smolVLM.ipynb    # SmolVLM 256M experiment
โ”œโ”€โ”€ ๐Ÿ““ 01_fine_tune_vlm_v3_smolVLM_500m.ipynb # โœ… Final: SmolVLM 500M two-stage training
โ”œโ”€โ”€ ๐Ÿ““ qwen3-food270-inference-viewer.ipynb  # Dataset visualization tool
โ”œโ”€โ”€ ๐Ÿ“„ README.md                            # Root project README
โ”œโ”€โ”€ ๐Ÿ“ data/
โ”‚   โ”œโ”€โ”€ food_dataset-2.jsonl                # Qwen3-VL-8B inference outputs
โ”‚   โ”œโ”€โ”€ food_labels_updated.json            # Processed food labels
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ 10_images_270_class/             # 10 sample images per category
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ food_all/                        # Merged dataset (food + not-food)
โ”‚   โ”‚   โ””โ”€โ”€ metadata.jsonl                  # HuggingFace imagefolder metadata
โ”‚   โ””โ”€โ”€ ๐Ÿ“ not_food/                        # Non-food images
โ””โ”€โ”€ ๐Ÿ“ demos/
    โ””โ”€โ”€ ๐Ÿ“ FoodExtract-Vision/
        โ”œโ”€โ”€ app.py                          # ๐Ÿš€ Gradio demo application
        โ”œโ”€โ”€ README.md                       # ๐Ÿ“– This file
        โ”œโ”€โ”€ requirements.txt                # ๐Ÿ“ฆ Python dependencies
        โ””โ”€โ”€ ๐Ÿ“ examples/                    # ๐Ÿ–ผ๏ธ Example images
            โ”œโ”€โ”€ 36741.jpg
            โ”œโ”€โ”€ IMG_3808.JPG
            โ””โ”€โ”€ istockphoto-175500494-612x612.jpg
```

---

## ๐Ÿ“ Key Learnings & Notes

### โœ… What Worked

- ๐Ÿ—๏ธ **Two-stage training** significantly improved output quality compared to single-stage
- ๐ŸงŠ **Freezing the vision encoder first** let the LLM learn JSON format without vision interference
- ๐Ÿข **100x lower learning rate in Stage 2** (`2e-6` vs `2e-4`) prevented catastrophic forgetting
- ๐Ÿค Even a **500M parameter model** can learn reliable structured output generation
- ๐Ÿ“ **Custom `collate_fn`** with proper label masking (pad tokens + image tokens โ†’ `-100`) was essential
- ๐Ÿ”€ **`remove_unused_columns = False`** is critical when using a custom data collator with `SFTTrainer`

### โš ๏ธ Important Notes

- **Dtype consistency:** Model inputs must match the model's dtype (e.g., `bfloat16` inputs for a `bfloat16` model)
- **System prompt handling:** When not using `transformers.pipeline`, the system prompt may need to be folded into the user prompt
- **PIL images over bytes:** Using `format_data()` as a list comprehension instead of `dataset.map()` preserves PIL image types
- **Gradient checkpointing:** Set `use_reentrant=False` to avoid warnings and ensure compatibility

### ๐Ÿงช Experiments Tried

| Notebook | Model | Approach | Result |
|---|---|---|---|
| `01-fine_tune_vlm.ipynb` | Gemma-3n-E2B | QLoRA + PEFT | โœ… Works but larger model |
| `01-fine_tune_vlm-v2-smolVLM.ipynb` | SmolVLM2-256M | Full fine-tune | ๐ŸŸก Limited capacity |
| `01_fine_tune_vlm_v3_smolVLM_500m.ipynb` | SmolVLM2-500M | **Two-stage full fine-tune** | โœ… **Best results** |

---

## ๐Ÿ”— Links

| Resource | URL |
|---|---|
| ๐Ÿค— Fine-tuned Model | [berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3](https://huggingface.co/berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3) |
| ๐Ÿค— Dataset | [berkeruveyik/vlm-food-4k-not-food-dataset](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset) |
| ๐Ÿค— Base Model | [HuggingFaceTB/SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) |
| ๐Ÿ“„ SmolVLM Docling Paper | [arxiv.org/pdf/2503.11576](https://arxiv.org/pdf/2503.11576) |
| ๐Ÿ“š TRL Documentation | [huggingface.co/docs/trl](https://huggingface.co/docs/trl/main/en/index) |
| ๐Ÿ“š PEFT GitHub | [github.com/huggingface/peft](https://github.com/huggingface/peft) |
| ๐Ÿ“š HF Vision Fine-tune Guide | [ai.google.dev/gemma/docs](https://ai.google.dev/gemma/docs/core/huggingface_vision_finetune_qlora?hl=tr) |

---

## ๐Ÿ“„ License

This project uses Apache 2.0 license. Please refer to the respective model and dataset cards for additional licensing information.

---

*Built with โค๏ธ using ๐Ÿค— Transformers, TRL, and Gradio โ€” by [Berker รœveyik](https://huggingface.co/berkeruveyik)*