berkeruveyik commited on
Commit
4a1815c
ยท
verified ยท
1 Parent(s): 3162c9a

Uploading FoodExtract-Vision demo folder

Browse files
Files changed (1) hide show
  1. README.md +195 -68
README.md CHANGED
@@ -1,26 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
  # ๐Ÿ•๐Ÿ” FoodExtract-Vision v1: Fine-tuned SmolVLM2-500M for Structured Food Tag Extraction
2
 
3
  [![Model on HuggingFace](https://img.shields.io/badge/๐Ÿค—%20Model-FoodExtract--Vision--SmolVLM2--500M-blue)](https://huggingface.co/berkeruveyik/FoodExtract-Vision-SmolVLM2-500M-fine-tune-v3)
4
  [![Dataset on HuggingFace](https://img.shields.io/badge/๐Ÿค—%20Dataset-vlm--food--4k--not--food-green)](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset)
 
 
5
 
6
- ## ๐Ÿ“‹ Overview
7
 
8
- **FoodExtract-Vision** is a fine-tuned Vision-Language Model (VLM) that classifies images as food/not-food and extracts structured food and drink tags in JSON format. Built on top of [SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct), this model demonstrates that even small VLMs can be fine-tuned to reliably produce structured outputs for domain-specific tasks.
9
 
 
10
 
 
11
 
 
12
 
 
13
 
14
- ### ๐ŸŽฏ What Does It Do?
15
 
16
- - **Input:** Any image (food or non-food)
17
- - **Output:** Structured JSON containing:
18
- - `is_food` โ€” binary classification (0 or 1)
19
- - `image_title` โ€” short food-related caption
20
- - `food_items` โ€” list of visible food item nouns
21
- - `drink_items` โ€” list of visible drink item nouns
22
 
23
- ### ๐Ÿ’ก Example Output
24
 
25
  ```json
26
  {
@@ -31,58 +46,130 @@
31
  }
32
  ```
33
 
 
 
 
 
 
 
 
34
  ---
35
 
36
- ## ๐Ÿ—๏ธ Architecture & Training Pipeline
37
 
38
- ### ๐Ÿง  Base Model
39
 
40
- - **Model:** `HuggingFaceTB/SmolVLM2-500M-Video-Instruct`
41
- - **Parameters:** ~500M
42
- - **Precision:** `bfloat16`
43
 
44
- ### ๐Ÿ“Š Dataset
 
 
 
 
 
45
 
46
- - **Source:** [`berkeruveyik/vlm-food-4k-not-food-dataset`](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset)
47
- - **Size:** ~3,698 image-JSON pairs
48
- - **Split:** 80% train / 20% validation
49
- - **Content:**
50
- - ๐Ÿ” Food images from the Food270 dataset (various cuisines, ingredients, prepared dishes)
51
- - ๐Ÿ–ผ๏ธ Non-food images (random internet images) to teach correct negative classification
52
 
53
- ### ๐Ÿ”ง Two-Stage Training Strategy
54
 
55
- Inspired by the [SmolVLM Docling paper](https://arxiv.org/pdf/2503.11576), the fine-tuning follows a two-stage approach:
 
 
56
 
57
- #### Stage 1: LLM Alignment (Frozen Vision Encoder) ๐ŸงŠ
58
 
59
- - **Goal:** Teach the language model to output the desired JSON structure
60
- - **Frozen:** Vision encoder parameters
61
- - **Trainable:** LLM + connector layers
62
- - **Learning Rate:** `2e-4`
63
- - **Epochs:** 2
64
- - **Batch Size:** 8 (with gradient accumulation of 4)
65
 
66
- #### Stage 2: Full Model Fine-tuning (Unfrozen Vision Encoder) ๐Ÿ”ฅ
 
 
 
 
67
 
68
- - **Goal:** Allow the vision encoder to adapt for better food recognition
69
- - **Trainable:** All parameters (vision encoder + LLM + connector)
70
- - **Learning Rate:** `2e-6` (much lower to prevent catastrophic forgetting)
71
- - **Epochs:** 2
72
- - **Batch Size:** 8 (with gradient accumulation of 4)
73
 
74
- ### โš™๏ธ Training Configuration
75
 
76
- | Parameter | Stage 1 | Stage 2 |
77
- |---|---|---|
78
- | Optimizer | `adamw_torch_fused` | `adamw_torch_fused` |
79
- | Learning Rate | `2e-4` | `2e-6` |
80
- | LR Scheduler | `constant` | `constant` |
81
- | Warmup Ratio | `0.03` | `0.03` |
82
- | Max Grad Norm | `1.0` | `1.0` |
83
- | Precision | `bf16` | `bf16` |
84
- | Gradient Checkpointing | โœ… | โœ… |
85
- | Vision Encoder | โ„๏ธ Frozen | ๐Ÿ”ฅ Unfrozen |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
  ---
88
 
@@ -91,7 +178,7 @@ Inspired by the [SmolVLM Docling paper](https://arxiv.org/pdf/2503.11576), the f
91
  ### ๐Ÿ“ฆ Installation
92
 
93
  ```bash
94
- pip install transformers torch gradio spaces
95
  ```
96
 
97
  ### ๐Ÿ”ฎ Inference with Pipeline
@@ -99,6 +186,7 @@ pip install transformers torch gradio spaces
99
  ```python
100
  import torch
101
  from transformers import pipeline
 
102
 
103
  FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3"
104
 
@@ -113,12 +201,14 @@ prompt = """Classify the given input image into food or not and if edible food o
113
 
114
  Only return valid JSON in the following form:
115
 
 
116
  {
117
  "is_food": 0,
118
  "image_title": "",
119
  "food_items": [],
120
  "drink_items": []
121
  }
 
122
  """
123
 
124
  messages = [
@@ -185,31 +275,54 @@ print(decoded)
185
 
186
  ## ๐ŸŽฎ Gradio Demo
187
 
 
 
188
  ### โ–ถ๏ธ Running Locally
189
 
190
  ```bash
191
- cd demos/FoodExtract-Vision-v1
 
192
  python app.py
193
  ```
194
 
195
- The demo launches a Gradio interface that lets you:
 
 
 
 
 
 
196
 
197
- 1. ๐Ÿ“ค Upload any image
198
- 2. ๐Ÿ”„ Compare outputs from the **base model** vs. the **fine-tuned model** side-by-side
199
- 3. ๐Ÿ“Š See structured JSON extraction in real-time
200
 
201
  ---
202
 
203
  ## ๐Ÿ“ Project Structure
204
 
205
  ```
206
- demos/FoodExtract-Vision-v1/
207
- โ”œโ”€โ”€ app.py # ๐Ÿš€ Gradio demo application
208
- โ”œโ”€โ”€ README.md # ๐Ÿ“– This file
209
- โ””โ”€โ”€ examples/ # ๐Ÿ–ผ๏ธ Example images for the demo
210
- โ”œโ”€โ”€ 1.jpeg # ๐Ÿ“ท Non-food example
211
- โ”œโ”€โ”€ 2.jpg # ๐Ÿ— Food example
212
- โ””โ”€โ”€ 3.jpeg # ๐ŸŸ Food example
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
213
  ```
214
 
215
  ---
@@ -218,16 +331,27 @@ demos/FoodExtract-Vision-v1/
218
 
219
  ### โœ… What Worked
220
 
221
- - ๐Ÿ—๏ธ **Two-stage training** significantly improved output quality compared to single-stage training
222
- - ๐ŸงŠ **Freezing the vision encoder first** allowed the LLM to learn the output format without interference
223
- - ๐Ÿข **Lower learning rate in Stage 2** (`2e-6` vs `2e-4`) prevented catastrophic forgetting of Stage 1 progress
224
  - ๐Ÿค Even a **500M parameter model** can learn reliable structured output generation
 
 
225
 
226
  ### โš ๏ธ Important Notes
227
 
228
- - **Dtype consistency:** Ensure model inputs match the model's dtype (e.g., `bfloat16` inputs for a `bfloat16` model)
229
- - **System prompt handling:** When not using `transformers.pipeline`, the system prompt may need to be folded into the user prompt to avoid errors
230
- - **`remove_unused_columns = False`** is critical when using a custom data collator with `SFTTrainer`
 
 
 
 
 
 
 
 
 
231
 
232
  ---
233
 
@@ -241,11 +365,14 @@ demos/FoodExtract-Vision-v1/
241
  | ๐Ÿ“„ SmolVLM Docling Paper | [arxiv.org/pdf/2503.11576](https://arxiv.org/pdf/2503.11576) |
242
  | ๐Ÿ“š TRL Documentation | [huggingface.co/docs/trl](https://huggingface.co/docs/trl/main/en/index) |
243
  | ๐Ÿ“š PEFT GitHub | [github.com/huggingface/peft](https://github.com/huggingface/peft) |
 
244
 
245
  ---
246
 
247
  ## ๐Ÿ“„ License
248
 
249
- Please refer to the respective model and dataset cards for licensing information. The license is Apache 2.0.
 
250
  ---
251
 
 
 
1
+ ---
2
+ title: FoodExtract-Vision
3
+ emoji: ๐Ÿ•
4
+ colorFrom: red
5
+ colorTo: yellow
6
+ sdk: gradio
7
+ sdk_version: "5.50.0"
8
+ python_version: "3.12"
9
+ app_file: app.py
10
+ pinned: false
11
+ ---
12
+
13
  # ๐Ÿ•๐Ÿ” FoodExtract-Vision v1: Fine-tuned SmolVLM2-500M for Structured Food Tag Extraction
14
 
15
  [![Model on HuggingFace](https://img.shields.io/badge/๐Ÿค—%20Model-FoodExtract--Vision--SmolVLM2--500M-blue)](https://huggingface.co/berkeruveyik/FoodExtract-Vision-SmolVLM2-500M-fine-tune-v3)
16
  [![Dataset on HuggingFace](https://img.shields.io/badge/๐Ÿค—%20Dataset-vlm--food--4k--not--food-green)](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset)
17
+ [![Base Model](https://img.shields.io/badge/๐Ÿง %20Base-SmolVLM2--500M--Video--Instruct-orange)](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct)
18
+ [![License](https://img.shields.io/badge/๐Ÿ“„%20License-Apache%202.0-lightgrey)](https://www.apache.org/licenses/LICENSE-2.0)
19
 
20
+ ---
21
 
22
+ ## ๐Ÿ“‹ Overview
23
 
24
+ **FoodExtract-Vision** is a fine-tuned Vision-Language Model (VLM) that takes any image as input and produces **structured JSON output** classifying whether food/drink items are visible and extracting them into organized lists.
25
 
26
+ Built on top of [SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct), this project demonstrates that even **small (~500M parameter) VLMs** can be fine-tuned to reliably produce structured outputs for domain-specific tasks โ€” without needing PEFT/LoRA adapters.
27
 
28
+ > ๐Ÿ’ก **Key Insight:** The base model often fails to follow the required JSON output structure, producing inconsistent or unstructured responses. After two-stage fine-tuning, the model **reliably generates valid JSON** matching the specified schema.
29
 
30
+ ---
31
 
32
+ ## ๐ŸŽฏ What Does It Do?
33
 
34
+ | | Input | Output |
35
+ |---|---|---|
36
+ | ๐Ÿ“ธ | Any image (food or non-food) | Structured JSON |
 
 
 
37
 
38
+ ### Output Schema
39
 
40
  ```json
41
  {
 
46
  }
47
  ```
48
 
49
+ | Field | Type | Description |
50
+ |---|---|---|
51
+ | `is_food` | `int` | `0` = no food/drink visible, `1` = food/drink visible |
52
+ | `image_title` | `str` | Short food-related caption (blank if no food) |
53
+ | `food_items` | `list[str]` | List of visible edible food item nouns |
54
+ | `drink_items` | `list[str]` | List of visible edible drink item nouns |
55
+
56
  ---
57
 
58
+ ## ๐Ÿ› ๏ธ What Was Done โ€” End-to-End Pipeline
59
 
60
+ This project covers the **full ML lifecycle** from dataset creation to deployment:
61
 
62
+ ### Step 1: ๐Ÿ“Š Dataset Creation (`00_create_vlm_dataset.ipynb`)
 
 
63
 
64
+ 1. ๐Ÿท๏ธ Loaded food labels from `data/food_dataset-2.jsonl` (generated via Qwen3-VL-8B inference on Food270 images)
65
+ 2. ๐Ÿ“ Added metadata fields (`image_id`, `image_name`, `food270_class_name`, `image_source`)
66
+ 3. ๐Ÿ–ผ๏ธ Sampled **not-food images** from `data/not_food/` and created empty labels with `is_food = 0`
67
+ 4. ๐Ÿ”€ Merged food + not-food labels into a unified dataset
68
+ 5. ๐Ÿ“ Copied all images into `data/food_all/` and wrote `metadata.jsonl` for HuggingFace `imagefolder` format
69
+ 6. ๐Ÿš€ Pushed to HuggingFace Hub as [`berkeruveyik/vlm-food-4k-not-food-dataset`](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset)
70
 
71
+ **Final dataset:** ~3,698 image-JSON pairs across **270 food categories** + not-food images
 
 
 
 
 
72
 
73
+ ### Step 2: ๐Ÿงช Base Model Evaluation (`01_fine_tune_vlm_v3_smolVLM_500m.ipynb`)
74
 
75
+ - Tested `SmolVLM2-500M-Video-Instruct` on the food extraction task
76
+ - **Result:** The base model produced unstructured text like *"The given image is a food or drink item."* instead of valid JSON
77
+ - โŒ Base model **cannot** follow the structured output format
78
 
79
+ ### Step 3: ๐Ÿ“ Data Formatting for SFT
80
 
81
+ Converted each sample to a **conversational message format** with three roles:
 
 
 
 
 
82
 
83
+ ```
84
+ [SYSTEM] โ†’ Expert food extractor persona
85
+ [USER] โ†’ Image + JSON extraction prompt
86
+ [ASSISTANT] โ†’ Ground truth JSON output
87
+ ```
88
 
89
+ - Used `PIL.Image` objects directly (not bytes) to preserve image quality
90
+ - 80/20 train/validation split with `random.seed(42)` for reproducibility
 
 
 
91
 
92
+ ### Step 4: ๐ŸงŠ Stage 1 Training โ€” Frozen Vision Encoder
93
 
94
+ - **Froze** the vision encoder (`model.model.vision_model`)
95
+ - **Trained** only the LLM + connector layers
96
+ - **Goal:** Teach the language model to output valid JSON structure
97
+ - Used `SFTTrainer` from TRL with custom `collate_fn` for image-text batching
98
+
99
+ ### Step 5: ๐Ÿ”ฅ Stage 2 Training โ€” Full Model Fine-tuning
100
+
101
+ - **Unfroze** the vision encoder
102
+ - **Trained** all parameters with a **100x lower learning rate** (`2e-6` vs `2e-4`)
103
+ - **Goal:** Allow the vision encoder to adapt for better food recognition without catastrophic forgetting
104
+
105
+ ### Step 6: ๐Ÿ“ˆ Evaluation & Comparison
106
+
107
+ - Compared outputs from 3 models side-by-side:
108
+ - ๐Ÿ”ด **Pre-trained** (base model) โ€” fails at structured output
109
+ - ๐ŸŸก **Stage 1** (frozen vision) โ€” learns JSON format
110
+ - ๐ŸŸข **Stage 2** (full fine-tune) โ€” best food recognition + JSON format
111
+
112
+ ### Step 7: ๐Ÿš€ Deployment
113
+
114
+ - Uploaded fine-tuned model to HuggingFace Hub
115
+ - Built Gradio demo with side-by-side comparison
116
+ - Deployed as a HuggingFace Space
117
+
118
+ ---
119
+
120
+ ## ๐Ÿ—๏ธ Architecture & Training Details
121
+
122
+ ### ๐Ÿง  Base Model
123
+
124
+ | Property | Value |
125
+ |---|---|
126
+ | Model | `HuggingFaceTB/SmolVLM2-500M-Video-Instruct` |
127
+ | Parameters | ~500M |
128
+ | Precision | `bfloat16` |
129
+ | Attention | `eager` |
130
+
131
+ ### ๐Ÿ“Š Dataset
132
+
133
+ | Property | Value |
134
+ |---|---|
135
+ | Source | [`berkeruveyik/vlm-food-4k-not-food-dataset`](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset) |
136
+ | Total Samples | ~3,698 image-JSON pairs |
137
+ | Train / Val Split | 80% / 20% |
138
+ | Food Categories | 270 (from Food270 dataset) |
139
+ | Non-food Images | Random internet images |
140
+ | Label Source | Qwen3-VL-8B inference outputs |
141
+
142
+ ### ๐Ÿ”ง Two-Stage Training Strategy
143
+
144
+ Inspired by the [SmolVLM Docling paper](https://arxiv.org/pdf/2503.11576):
145
+
146
+ #### ๐ŸงŠ Stage 1: LLM Alignment (Frozen Vision Encoder)
147
+
148
+ | Parameter | Value |
149
+ |---|---|
150
+ | Vision Encoder | โ„๏ธ Frozen |
151
+ | Trainable | LLM + connector layers |
152
+ | Learning Rate | `2e-4` |
153
+ | Epochs | 2 |
154
+ | Batch Size | 8 ร— 4 gradient accumulation = effective 32 |
155
+ | Optimizer | `adamw_torch_fused` |
156
+ | LR Scheduler | `constant` |
157
+ | Warmup Ratio | `0.03` |
158
+ | Precision | `bf16` |
159
+
160
+ #### ๐Ÿ”ฅ Stage 2: Full Model Fine-tuning (Unfrozen Vision Encoder)
161
+
162
+ | Parameter | Value |
163
+ |---|---|
164
+ | Vision Encoder | ๐Ÿ”ฅ Unfrozen |
165
+ | Trainable | All parameters |
166
+ | Learning Rate | `2e-6` (100x lower than Stage 1) |
167
+ | Epochs | 2 |
168
+ | Batch Size | 8 ร— 4 gradient accumulation = effective 32 |
169
+ | Optimizer | `adamw_torch_fused` |
170
+ | LR Scheduler | `constant` |
171
+ | Warmup Ratio | `0.03` |
172
+ | Precision | `bf16` |
173
 
174
  ---
175
 
 
178
  ### ๐Ÿ“ฆ Installation
179
 
180
  ```bash
181
+ pip install transformers torch gradio spaces accelerate
182
  ```
183
 
184
  ### ๐Ÿ”ฎ Inference with Pipeline
 
186
  ```python
187
  import torch
188
  from transformers import pipeline
189
+ from PIL import Image
190
 
191
  FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3"
192
 
 
201
 
202
  Only return valid JSON in the following form:
203
 
204
+ ```json
205
  {
206
  "is_food": 0,
207
  "image_title": "",
208
  "food_items": [],
209
  "drink_items": []
210
  }
211
+ ```
212
  """
213
 
214
  messages = [
 
275
 
276
  ## ๐ŸŽฎ Gradio Demo
277
 
278
+ This Space runs a **side-by-side comparison** between the base model and the fine-tuned model.
279
+
280
  ### โ–ถ๏ธ Running Locally
281
 
282
  ```bash
283
+ cd demos/FoodExtract-Vision
284
+ pip install -r requirements.txt
285
  python app.py
286
  ```
287
 
288
+ ### ๐Ÿ–ฅ๏ธ What the Demo Shows
289
+
290
+ 1. ๐Ÿ“ค **Upload** any image
291
+ 2. ๐Ÿ”„ **Compare** outputs from the base model vs. the fine-tuned model side-by-side
292
+ 3. ๐Ÿ“Š See how fine-tuning enables **reliable structured JSON extraction**
293
+
294
+ ### ๐Ÿ“ธ Example Images Included
295
 
296
+ The demo comes with pre-loaded examples to try instantly.
 
 
297
 
298
  ---
299
 
300
  ## ๐Ÿ“ Project Structure
301
 
302
  ```
303
+ vlm_finetune/
304
+ โ”œโ”€โ”€ ๐Ÿ““ 00_create_vlm_dataset.ipynb # Dataset creation pipeline
305
+ โ”œโ”€โ”€ ๐Ÿ““ 01-fine_tune_vlm.ipynb # First fine-tuning experiment (Gemma-3n)
306
+ โ”œโ”€โ”€ ๐Ÿ““ 01-fine_tune_vlm-v2-smolVLM.ipynb # SmolVLM 256M experiment
307
+ โ”œโ”€โ”€ ๐Ÿ““ 01_fine_tune_vlm_v3_smolVLM_500m.ipynb # โœ… Final: SmolVLM 500M two-stage training
308
+ โ”œโ”€โ”€ ๐Ÿ““ qwen3-food270-inference-viewer.ipynb # Dataset visualization tool
309
+ โ”œโ”€โ”€ ๐Ÿ“„ README.md # Root project README
310
+ โ”œโ”€โ”€ ๐Ÿ“ data/
311
+ โ”‚ โ”œโ”€โ”€ food_dataset-2.jsonl # Qwen3-VL-8B inference outputs
312
+ โ”‚ โ”œโ”€โ”€ food_labels_updated.json # Processed food labels
313
+ โ”‚ โ”œโ”€โ”€ ๐Ÿ“ 10_images_270_class/ # 10 sample images per category
314
+ โ”‚ โ”œโ”€โ”€ ๐Ÿ“ food_all/ # Merged dataset (food + not-food)
315
+ โ”‚ โ”‚ โ””โ”€โ”€ metadata.jsonl # HuggingFace imagefolder metadata
316
+ โ”‚ โ””โ”€โ”€ ๐Ÿ“ not_food/ # Non-food images
317
+ โ””โ”€โ”€ ๐Ÿ“ demos/
318
+ โ””โ”€โ”€ ๐Ÿ“ FoodExtract-Vision/
319
+ โ”œโ”€โ”€ app.py # ๐Ÿš€ Gradio demo application
320
+ โ”œโ”€โ”€ README.md # ๐Ÿ“– This file
321
+ โ”œโ”€โ”€ requirements.txt # ๐Ÿ“ฆ Python dependencies
322
+ โ””โ”€โ”€ ๐Ÿ“ examples/ # ๐Ÿ–ผ๏ธ Example images
323
+ โ”œโ”€โ”€ 36741.jpg
324
+ โ”œโ”€โ”€ IMG_3808.JPG
325
+ โ””โ”€โ”€ istockphoto-175500494-612x612.jpg
326
  ```
327
 
328
  ---
 
331
 
332
  ### โœ… What Worked
333
 
334
+ - ๐Ÿ—๏ธ **Two-stage training** significantly improved output quality compared to single-stage
335
+ - ๐ŸงŠ **Freezing the vision encoder first** let the LLM learn JSON format without vision interference
336
+ - ๐Ÿข **100x lower learning rate in Stage 2** (`2e-6` vs `2e-4`) prevented catastrophic forgetting
337
  - ๐Ÿค Even a **500M parameter model** can learn reliable structured output generation
338
+ - ๐Ÿ“ **Custom `collate_fn`** with proper label masking (pad tokens + image tokens โ†’ `-100`) was essential
339
+ - ๐Ÿ”€ **`remove_unused_columns = False`** is critical when using a custom data collator with `SFTTrainer`
340
 
341
  ### โš ๏ธ Important Notes
342
 
343
+ - **Dtype consistency:** Model inputs must match the model's dtype (e.g., `bfloat16` inputs for a `bfloat16` model)
344
+ - **System prompt handling:** When not using `transformers.pipeline`, the system prompt may need to be folded into the user prompt
345
+ - **PIL images over bytes:** Using `format_data()` as a list comprehension instead of `dataset.map()` preserves PIL image types
346
+ - **Gradient checkpointing:** Set `use_reentrant=False` to avoid warnings and ensure compatibility
347
+
348
+ ### ๐Ÿงช Experiments Tried
349
+
350
+ | Notebook | Model | Approach | Result |
351
+ |---|---|---|---|
352
+ | `01-fine_tune_vlm.ipynb` | Gemma-3n-E2B | QLoRA + PEFT | โœ… Works but larger model |
353
+ | `01-fine_tune_vlm-v2-smolVLM.ipynb` | SmolVLM2-256M | Full fine-tune | ๐ŸŸก Limited capacity |
354
+ | `01_fine_tune_vlm_v3_smolVLM_500m.ipynb` | SmolVLM2-500M | **Two-stage full fine-tune** | โœ… **Best results** |
355
 
356
  ---
357
 
 
365
  | ๐Ÿ“„ SmolVLM Docling Paper | [arxiv.org/pdf/2503.11576](https://arxiv.org/pdf/2503.11576) |
366
  | ๐Ÿ“š TRL Documentation | [huggingface.co/docs/trl](https://huggingface.co/docs/trl/main/en/index) |
367
  | ๐Ÿ“š PEFT GitHub | [github.com/huggingface/peft](https://github.com/huggingface/peft) |
368
+ | ๐Ÿ“š HF Vision Fine-tune Guide | [ai.google.dev/gemma/docs](https://ai.google.dev/gemma/docs/core/huggingface_vision_finetune_qlora?hl=tr) |
369
 
370
  ---
371
 
372
  ## ๐Ÿ“„ License
373
 
374
+ This project uses Apache 2.0 license. Please refer to the respective model and dataset cards for additional licensing information.
375
+
376
  ---
377
 
378
+ *Built with โค๏ธ using ๐Ÿค— Transformers, TRL, and Gradio โ€” by [Berker รœveyik](https://huggingface.co/berkeruveyik)*