msaid1976 commited on
Commit
fb863a7
·
verified ·
1 Parent(s): 5cc1a61

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +297 -130
README.md CHANGED
@@ -1,202 +1,369 @@
1
  ---
 
 
 
2
  library_name: transformers
3
- language:
4
- - en
5
- base_model:
6
- - HuggingFaceTB/SmolVLM-Instruct
 
 
 
 
 
 
 
 
7
  ---
8
 
9
- # Model Card for Model ID
10
 
11
- <!-- Provide a quick summary of what the model is/does. -->
12
 
 
 
 
13
 
 
14
 
15
- ## Model Details
16
 
17
- ### Model Description
18
 
19
- <!-- Provide a longer summary of what this model is. -->
20
-
21
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
22
-
23
- - **Developed by:** [Mohamed Mohamed Said Aly Amin]
24
- - **Funded by [optional]:** [APU - Asia Pacific University]
25
- - **Shared by [optional]:** [More Information Needed]
26
- - **Model type:** [Multi-Modal]
27
- - **Language(s) (NLP):** [More Information Needed]
28
- - **License:** [More Information Needed]
29
- - **Finetuned from model [optional]:** [More Information Needed]
30
-
31
- ### Model Sources [optional]
32
-
33
- <!-- Provide the basic links for the model. -->
34
-
35
- - **Repository:** [More Information Needed]
36
- - **Paper [optional]:** [More Information Needed]
37
- - **Demo [optional]:** [More Information Needed]
38
-
39
- ## Uses
40
-
41
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
42
-
43
- ### Direct Use
44
-
45
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
46
-
47
- [More Information Needed]
48
-
49
- ### Downstream Use [optional]
50
-
51
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
52
-
53
- [More Information Needed]
54
-
55
- ### Out-of-Scope Use
56
-
57
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
58
-
59
- [More Information Needed]
60
-
61
- ## Bias, Risks, and Limitations
62
-
63
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
64
-
65
- [More Information Needed]
66
-
67
- ### Recommendations
68
-
69
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
70
-
71
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
72
-
73
- ## How to Get Started with the Model
74
-
75
- Use the code below to get started with the model.
76
-
77
- [More Information Needed]
78
 
79
- ## Training Details
80
 
81
- ### Training Data
82
 
83
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 
 
 
 
 
84
 
85
- [More Information Needed]
 
 
 
86
 
87
- ### Training Procedure
88
 
89
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
90
 
91
- #### Preprocessing [optional]
92
 
93
- [More Information Needed]
 
 
 
 
94
 
 
95
 
96
- #### Training Hyperparameters
 
 
 
 
97
 
98
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
99
 
100
- #### Speeds, Sizes, Times [optional]
101
 
102
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
103
 
104
- [More Information Needed]
 
 
 
 
 
 
 
 
105
 
106
- ## Evaluation
107
 
108
- <!-- This section describes the evaluation protocols and provides the results. -->
109
 
110
- ### Testing Data, Factors & Metrics
 
111
 
112
- #### Testing Data
113
 
114
- <!-- This should link to a Dataset Card if possible. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
 
116
- [More Information Needed]
117
 
118
- #### Factors
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
121
 
122
- [More Information Needed]
123
 
124
- #### Metrics
125
 
126
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
127
 
128
- [More Information Needed]
129
 
130
- ### Results
131
 
132
- [More Information Needed]
 
 
 
 
 
133
 
134
- #### Summary
135
 
 
136
 
 
 
 
 
137
 
138
- ## Model Examination [optional]
139
 
140
- <!-- Relevant interpretability work for the model goes here -->
141
 
142
- [More Information Needed]
143
 
144
- ## Environmental Impact
 
 
 
 
145
 
146
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
147
 
148
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
 
 
 
 
149
 
150
- - **Hardware Type:** [More Information Needed]
151
- - **Hours used:** [More Information Needed]
152
- - **Cloud Provider:** [More Information Needed]
153
- - **Compute Region:** [More Information Needed]
154
- - **Carbon Emitted:** [More Information Needed]
155
 
156
- ## Technical Specifications [optional]
157
 
158
- ### Model Architecture and Objective
159
 
160
- [More Information Needed]
 
 
 
 
161
 
162
- ### Compute Infrastructure
163
 
164
- [More Information Needed]
 
 
 
 
 
165
 
166
- #### Hardware
 
 
 
 
167
 
168
- [More Information Needed]
169
 
170
- #### Software
 
 
 
171
 
172
- [More Information Needed]
173
 
174
- ## Citation [optional]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
 
176
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
177
 
178
- **BibTeX:**
179
 
180
- [More Information Needed]
 
 
181
 
182
- **APA:**
 
 
 
183
 
184
- [More Information Needed]
 
 
 
185
 
186
- ## Glossary [optional]
187
 
188
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
189
 
190
- [More Information Needed]
 
 
191
 
192
- ## More Information [optional]
193
 
194
- [More Information Needed]
195
 
196
- ## Model Card Authors [optional]
197
 
198
- [More Information Needed]
 
199
 
200
- ## Model Card Contact
201
 
202
- [More Information Needed]
 
1
  ---
2
+ language: en
3
+ license: apache-2.0
4
+ base_model: HuggingFaceTB/SmolVLM-500M-Instruct
5
  library_name: transformers
6
+ pipeline_tag: image-text-to-text
7
+ tags:
8
+ - Vision
9
+ - Image-to-text
10
+ - Multimodal
11
+ - Vision-language-model
12
+ - Navigation
13
+ - Accessibility
14
+ - Assistive-technology
15
+ - Blind-assistance
16
+ - Fine-tuned
17
+ - SmolVLM
18
  ---
19
 
20
+ # SmolVLM Navigation Assistant 🦯
21
 
22
+ <div align="center">
23
 
24
+ [![Model](https://img.shields.io/badge/Model-SmolVLM--500M-blue)](https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct)
25
+ [![License](https://img.shields.io/badge/License-Apache%202.0-green)](https://www.apache.org/licenses/LICENSE-2.0)
26
+ [![BERTScore](https://img.shields.io/badge/BERTScore-91.6%25-brightgreen)](https://huggingface.co/metrics/bertscore)
27
 
28
+ **Fine-tuned vision-language model for blind navigation assistance**
29
 
30
+ [Quick Start](#-quick-start) • [Performance](#-performance) • [Usage](#-usage) • [Training](#-training-details) • [Citation](#-citation)
31
 
32
+ </div>
33
 
34
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
+ ## 📋 Overview
37
 
38
+ Fine-tuned [SmolVLM-500M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct) for **vision-based navigation assistance** for blind and visually impaired users. Developed as a Master's thesis project at Asia Pacific University.
39
 
40
+ **Key Results:**
41
+ - 🎯 **91.6% BERTScore** (semantic accuracy)
42
+ - 🚀 **+3483% BLEU-1** improvement over baseline
43
+ - ⚡ **0.5-1s inference** time
44
+ - 💾 **2-4GB VRAM** requirement
45
+ - 📊 **p < 0.001** statistical significance
46
 
47
+ **Author:** Mohammad Mohamed Said Aly Amin
48
+ **Supervisor:** Dr. Raheem Mafas
49
+ **Institution:** Asia Pacific University
50
+ **Program:** Master's in Data Science & Business Analytics
51
 
52
+ ---
53
 
54
+ ## Features
55
 
56
+ ### Three Navigation Modes
57
 
58
+ | Mode | Purpose | Response Length | Example Query |
59
+ |------|---------|-----------------|---------------|
60
+ | **🎯 FOCUSED** | Spatial relationships | 5-15 words | "Is there a chair to my left?" |
61
+ | **🌍 SCENE** | Environment description | 30-50 words | "Describe what's in front of me" |
62
+ | **📝 OCR** | Text recognition | Variable | "What does the sign say?" |
63
 
64
+ ### Technical Highlights
65
 
66
+ - Real-time inference on consumer GPUs
67
+ - ✅ Low memory footprint (2-4GB VRAM)
68
+ - ✅ Statistically validated improvements
69
+ - ✅ Production-ready deployment
70
+ - ✅ QLoRA efficient fine-tuning (1.84% parameters)
71
 
72
+ ---
73
 
74
+ ## 📊 Performance
75
 
76
+ ### Evaluation Results (500 samples)
77
 
78
+ | Metric | Fine-tuned | Baseline | Improvement |
79
+ |--------|-----------|----------|-------------|
80
+ | **BLEU** | 0.234 | - | - |
81
+ | **BLEU-1** | 24.89 | 0.69 | **+3483%** 🚀 |
82
+ | **ROUGE-1** | 55.72 | 13.66 | **+308%** |
83
+ | **ROUGE-2** | 32.46 | 2.69 | **+1105%** |
84
+ | **ROUGE-L** | 48.27 | 11.82 | **+308%** |
85
+ | **BERTScore** | 91.63 | 85.60 | **+7.04%** |
86
+ | **Length Ratio** | 0.93 | - | Nearly perfect |
87
 
88
+ **Statistical Validation:** All improvements significant at p < 0.001 (paired t-test, n=500)
89
 
90
+ ### Loss Convergence
91
 
92
+ - Initial Training Loss: **0.29** → Final: **0.12** (58% reduction)
93
+ - Initial Val Loss: **0.24** → Final: **0.13** (46% reduction)
94
 
95
+ ---
96
 
97
+ ## 🚀 Quick Start
98
+
99
+ ### Installation
100
+
101
+ ```bash
102
+ pip install transformers torch pillow accelerate
103
+ ```
104
+
105
+ ### Basic Usage
106
+
107
+ ```python
108
+ from transformers import Idefics3ForConditionalGeneration, AutoProcessor
109
+ from PIL import Image
110
+ import torch
111
+
112
+ # Load model
113
+ model = Idefics3ForConditionalGeneration.from_pretrained(
114
+ "msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
115
+ torch_dtype=torch.float16,
116
+ device_map="auto",
117
+ trust_remote_code=True
118
+ )
119
+ processor = AutoProcessor.from_pretrained(
120
+ "msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
121
+ trust_remote_code=True
122
+ )
123
+
124
+ # Prepare input
125
+ image = Image.open("scene.jpg")
126
+ messages = [{
127
+ "role": "user",
128
+ "content": [
129
+ {"type": "image"},
130
+ {"type": "text", "text": "What do you see?"}
131
+ ]
132
+ }]
133
+
134
+ # Generate
135
+ prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
136
+ inputs = processor(text=prompt, images=[image], return_tensors="pt")
137
+ inputs = {k: v.to("cuda") for k, v in inputs.items()}
138
+
139
+ with torch.no_grad():
140
+ outputs = model.generate(
141
+ **inputs,
142
+ max_new_tokens=150,
143
+ do_sample=False,
144
+ pad_token_id=processor.tokenizer.eos_token_id,
145
+ eos_token_id=processor.tokenizer.eos_token_id
146
+ )
147
+
148
+ response = processor.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
149
+ print(response)
150
+ ```
151
 
152
+ ---
153
 
154
+ ## 💡 Usage Examples
155
+
156
+ ### FOCUSED: Spatial Queries
157
+
158
+ ```python
159
+ messages = [{
160
+ "role": "user",
161
+ "content": [
162
+ {"type": "image"},
163
+ {"type": "text", "text": "Is there a chair to the left of the table?"}
164
+ ]
165
+ }]
166
+ # Output: "Yes, there is a chair to the left of the table."
167
+ ```
168
+
169
+ ### SCENE: Environment Description
170
+
171
+ ```python
172
+ messages = [{
173
+ "role": "user",
174
+ "content": [
175
+ {"type": "image"},
176
+ {"type": "text", "text": "Describe the scene in front of me."}
177
+ ]
178
+ }]
179
+ # Output: "The scene shows a living room with a brown sofa on the left,
180
+ # a wooden coffee table in the center, and a TV on the wall..."
181
+ ```
182
+
183
+ ### OCR: Text Reading
184
+
185
+ ```python
186
+ messages = [{
187
+ "role": "user",
188
+ "content": [
189
+ {"type": "image"},
190
+ {"type": "text", "text": "What text is on the sign?"}
191
+ ]
192
+ }]
193
+ # Output: "The sign says 'EXIT' in red letters."
194
+ ```
195
+
196
+ ### Memory Optimization
197
+
198
+ ```python
199
+ # 8-bit quantization (reduces to ~2GB VRAM)
200
+ model = Idefics3ForConditionalGeneration.from_pretrained(
201
+ "msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
202
+ load_in_8bit=True,
203
+ device_map="auto"
204
+ )
205
+
206
+ # Batch processing
207
+ inputs = processor(
208
+ text=[prompt1, prompt2, prompt3],
209
+ images=[[img1], [img2], [img3]],
210
+ return_tensors="pt",
211
+ padding=True
212
+ )
213
+ ```
214
 
215
+ ---
216
 
217
+ ## 🛠️ Training Details
218
 
219
+ ### Configuration
220
 
221
+ | Parameter | Value | Description |
222
+ |-----------|-------|-------------|
223
+ | **Base Model** | SmolVLM-500M-Instruct | 500M parameters |
224
+ | **Method** | QLoRA | 4-bit quantization |
225
+ | **Trainable Params** | 42M (1.84%) | LoRA adapters only |
226
+ | **LoRA Rank** | 32 | Adapter dimension |
227
+ | **LoRA Alpha** | 64 | Scaling factor |
228
+ | **Epochs** | 3 | Full data passes |
229
+ | **Batch Size** | 1 (effective: 16) | With gradient accumulation |
230
+ | **Learning Rate** | 2e-5 | AdamW optimizer |
231
+ | **Precision** | BF16 | Mixed precision |
232
+ | **GPU** | RTX 5070 Ti 16GB | Training hardware |
233
+ | **Training Time** | ~20 hours | Total duration |
234
+ | **Peak VRAM** | 7-9GB | During training |
235
 
236
+ ### Dataset
237
 
238
+ **Size:** 10,000+ samples across three modes
239
 
240
+ **Sources:**
241
+ - GQA Enhanced (spatial reasoning)
242
+ - Localized Narratives (scene descriptions)
243
+ - Visual Genome (object relationships)
244
+ - TextCaps (text-in-image)
245
+ - VizWiz (accessibility focus)
246
 
247
+ ---
248
 
249
+ ## 💻 Hardware Requirements
250
 
251
+ | Use Case | GPU | RAM | Storage |
252
+ |----------|-----|-----|---------|
253
+ | **Inference** | 4GB+ VRAM | 8GB | 5GB |
254
+ | **Training** | 16GB VRAM | 32GB | 50GB |
255
 
256
+ **Recommended for Inference:** RTX 3060+ or equivalent
257
 
258
+ ---
259
 
260
+ ## ⚠️ Limitations
261
 
262
+ 1. **Scope:** Optimized for navigation; may underperform on general VQA
263
+ 2. **Image Quality:** Best with well-lit, clear images
264
+ 3. **OCR:** Works best with printed text; struggles with handwriting
265
+ 4. **Speed:** Requires GPU for real-time use (CPU: 10-20s/image)
266
+ 5. **Language:** English only
267
 
268
+ ### Safety Notice
269
 
270
+ ⚠️ **This is an assistive tool, not a replacement for traditional navigation aids.** Users should:
271
+ - Combine with cane, guide dog, or other mobility aids
272
+ - Exercise human judgment
273
+ - Test in safe environments first
274
+ - Be aware of potential errors
275
 
276
+ ---
 
 
 
 
277
 
278
+ ## 🎓 Model Card
279
 
280
+ ### Model Details
281
 
282
+ - **Type:** Vision-Language Model (Idefics3)
283
+ - **Parameters:** 500M total, 42M trainable (1.84%)
284
+ - **Input:** Image + Text
285
+ - **Output:** Text
286
+ - **License:** Apache 2.0
287
 
288
+ ### Intended Use
289
 
290
+ **Primary:**
291
+ - Navigation assistance for blind/visually impaired
292
+ - Spatial reasoning and object localization
293
+ - Scene understanding and description
294
+ - Text recognition in natural environments
295
+ - Accessibility research
296
 
297
+ **Out of Scope:**
298
+ - Medical diagnosis
299
+ - Autonomous navigation without human oversight
300
+ - Real-time video processing
301
+ - General-purpose VQA (use base model)
302
 
303
+ ### Ethical Considerations
304
 
305
+ - Designed to enhance independence, not replace human judgment
306
+ - May have biases from English-only training data
307
+ - Requires validation in real-world scenarios
308
+ - Processes images locally (no data collection)
309
 
310
+ ---
311
 
312
+ ## 📖 Citation
313
+
314
+ ```bibtex
315
+ @misc{alqahtani2025smolvlm_navigation,
316
+ author = {Alqahtani, Muhammad Said},
317
+ title = {SmolVLM Navigation Assistant: Fine-tuned for Blind Navigation},
318
+ year = {2025},
319
+ publisher = {HuggingFace},
320
+ howpublished = {\url{https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned}}
321
+ }
322
+
323
+ @mastersthesis{alqahtani2025thesis,
324
+ author = {Alqahtani, Muhammad Said},
325
+ title = {An Efficient Multi-Object Detection and Smart Navigation Using Vision Language Models for Visually Impaired},
326
+ school = {Asia Pacific University of Technology and Innovation},
327
+ year = {2025},
328
+ address = {Kuala Lumpur, Malaysia}
329
+ }
330
+ ```
331
 
332
+ ---
333
 
334
+ ## 🙏 Acknowledgments
335
 
336
+ **Supervision:**
337
+ - Dr. Raheem Mafas (Research Supervisor)
338
+ - Asia Pacific University
339
 
340
+ **Technical:**
341
+ - HuggingFace Team (base model & libraries)
342
+ - Unsloth (training framework)
343
+ - NVIDIA (GPU hardware)
344
 
345
+ **Datasets:**
346
+ - Stanford Visual Genome
347
+ - GQA, VizWiz, TextCaps
348
+ - Localized Narratives
349
 
350
+ ---
351
 
352
+ ## 📫 Contact
353
 
354
+ **Author:** Mohammad Mohamed Said Aly Amin
355
+ **Institution:** Asia Pacific University
356
+ **Issues:** [Model Discussions](https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned/discussions)
357
 
358
+ ---
359
 
360
+ <div align="center">
361
 
362
+ **Made with ❤️ for accessibility and inclusion**
363
 
364
+ [![HuggingFace](https://img.shields.io/badge/🤗-Model-yellow)](https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned)
365
+ [![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)
366
 
367
+ *Empowering independence through AI-powered vision assistance*
368
 
369
+ </div>