README.md · msaid1976/SmolVLM-Instruct-Navigation-FineTuned at main

SmolVLM-Instruct-Navigation-FineTuned / README.md

msaid1976

Update README.md

fb863a7 verified 25 days ago

preview code

raw

history blame contribute delete

9.86 kB

	---
	language: en
	license: apache-2.0
	base_model: HuggingFaceTB/SmolVLM-500M-Instruct
	library_name: transformers
	pipeline_tag: image-text-to-text
	tags:
	- Vision
	- Image-to-text
	- Multimodal
	- Vision-language-model
	- Navigation
	- Accessibility
	- Assistive-technology
	- Blind-assistance
	- Fine-tuned
	- SmolVLM
	---

	# SmolVLM Navigation Assistant 🦯

	<div align="center">

	[![Model](https://img.shields.io/badge/Model-SmolVLM--500M-blue)](https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct)
	[![License](https://img.shields.io/badge/License-Apache%202.0-green)](https://www.apache.org/licenses/LICENSE-2.0)
	[![BERTScore](https://img.shields.io/badge/BERTScore-91.6%25-brightgreen)](https://huggingface.co/metrics/bertscore)

	Fine-tuned vision-language model for blind navigation assistance

	[Quick Start](#-quick-start) • [Performance](#-performance) • [Usage](#-usage) • [Training](#-training-details) • [Citation](#-citation)

	</div>

	---

	## 📋 Overview

	Fine-tuned [SmolVLM-500M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct) for vision-based navigation assistance for blind and visually impaired users. Developed as a Master's thesis project at Asia Pacific University.

	Key Results:
	- 🎯 91.6% BERTScore (semantic accuracy)
	- 🚀 +3483% BLEU-1 improvement over baseline
	- ⚡ 0.5-1s inference time
	- 💾 2-4GB VRAM requirement
	- 📊 p < 0.001 statistical significance

	Author: Mohammad Mohamed Said Aly Amin
	Supervisor: Dr. Raheem Mafas
	Institution: Asia Pacific University
	Program: Master's in Data Science & Business Analytics

	---

	## ✨ Features

	### Three Navigation Modes

	\| Mode \| Purpose \| Response Length \| Example Query \|
	\|------\|---------\|-----------------\|---------------\|
	\| 🎯 FOCUSED \| Spatial relationships \| 5-15 words \| "Is there a chair to my left?" \|
	\| 🌍 SCENE \| Environment description \| 30-50 words \| "Describe what's in front of me" \|
	\| 📝 OCR \| Text recognition \| Variable \| "What does the sign say?" \|

	### Technical Highlights

	- ✅ Real-time inference on consumer GPUs
	- ✅ Low memory footprint (2-4GB VRAM)
	- ✅ Statistically validated improvements
	- ✅ Production-ready deployment
	- ✅ QLoRA efficient fine-tuning (1.84% parameters)

	---

	## 📊 Performance

	### Evaluation Results (500 samples)

	\| Metric \| Fine-tuned \| Baseline \| Improvement \|
	\|--------\|-----------\|----------\|-------------\|
	\| BLEU \| 0.234 \| - \| - \|
	\| BLEU-1 \| 24.89 \| 0.69 \| +3483% 🚀 \|
	\| ROUGE-1 \| 55.72 \| 13.66 \| +308% \|
	\| ROUGE-2 \| 32.46 \| 2.69 \| +1105% \|
	\| ROUGE-L \| 48.27 \| 11.82 \| +308% \|
	\| BERTScore \| 91.63 \| 85.60 \| +7.04% \|
	\| Length Ratio \| 0.93 \| - \| Nearly perfect \|

	Statistical Validation: All improvements significant at p < 0.001 (paired t-test, n=500)

	### Loss Convergence

	- Initial Training Loss: 0.29 → Final: 0.12 (58% reduction)
	- Initial Val Loss: 0.24 → Final: 0.13 (46% reduction)

	---

	## 🚀 Quick Start

	### Installation

	```bash
	pip install transformers torch pillow accelerate
	```

	### Basic Usage

	```python
	from transformers import Idefics3ForConditionalGeneration, AutoProcessor
	from PIL import Image
	import torch

	# Load model
	model = Idefics3ForConditionalGeneration.from_pretrained(
	"msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
	torch_dtype=torch.float16,
	device_map="auto",
	trust_remote_code=True
	)
	processor = AutoProcessor.from_pretrained(
	"msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
	trust_remote_code=True
	)

	# Prepare input
	image = Image.open("scene.jpg")
	messages = [{
	"role": "user",
	"content": [
	{"type": "image"},
	{"type": "text", "text": "What do you see?"}
	]
	}]

	# Generate
	prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
	inputs = processor(text=prompt, images=[image], return_tensors="pt")
	inputs = {k: v.to("cuda") for k, v in inputs.items()}

	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=150,
	do_sample=False,
	pad_token_id=processor.tokenizer.eos_token_id,
	eos_token_id=processor.tokenizer.eos_token_id
	)

	response = processor.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
	print(response)
	```

	---

	## 💡 Usage Examples

	### FOCUSED: Spatial Queries

	```python
	messages = [{
	"role": "user",
	"content": [
	{"type": "image"},
	{"type": "text", "text": "Is there a chair to the left of the table?"}
	]
	}]
	# Output: "Yes, there is a chair to the left of the table."
	```

	### SCENE: Environment Description

	```python
	messages = [{
	"role": "user",
	"content": [
	{"type": "image"},
	{"type": "text", "text": "Describe the scene in front of me."}
	]
	}]
	# Output: "The scene shows a living room with a brown sofa on the left,
	# a wooden coffee table in the center, and a TV on the wall..."
	```

	### OCR: Text Reading

	```python
	messages = [{
	"role": "user",
	"content": [
	{"type": "image"},
	{"type": "text", "text": "What text is on the sign?"}
	]
	}]
	# Output: "The sign says 'EXIT' in red letters."
	```

	### Memory Optimization

	```python
	# 8-bit quantization (reduces to ~2GB VRAM)
	model = Idefics3ForConditionalGeneration.from_pretrained(
	"msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
	load_in_8bit=True,
	device_map="auto"
	)

	# Batch processing
	inputs = processor(
	text=[prompt1, prompt2, prompt3],
	images=[[img1], [img2], [img3]],
	return_tensors="pt",
	padding=True
	)
	```

	---

	## 🛠️ Training Details

	### Configuration

	\| Parameter \| Value \| Description \|
	\|-----------\|-------\|-------------\|
	\| Base Model \| SmolVLM-500M-Instruct \| 500M parameters \|
	\| Method \| QLoRA \| 4-bit quantization \|
	\| Trainable Params \| 42M (1.84%) \| LoRA adapters only \|
	\| LoRA Rank \| 32 \| Adapter dimension \|
	\| LoRA Alpha \| 64 \| Scaling factor \|
	\| Epochs \| 3 \| Full data passes \|
	\| Batch Size \| 1 (effective: 16) \| With gradient accumulation \|
	\| Learning Rate \| 2e-5 \| AdamW optimizer \|
	\| Precision \| BF16 \| Mixed precision \|
	\| GPU \| RTX 5070 Ti 16GB \| Training hardware \|
	\| Training Time \| ~20 hours \| Total duration \|
	\| Peak VRAM \| 7-9GB \| During training \|

	### Dataset

	Size: 10,000+ samples across three modes

	Sources:
	- GQA Enhanced (spatial reasoning)
	- Localized Narratives (scene descriptions)
	- Visual Genome (object relationships)
	- TextCaps (text-in-image)
	- VizWiz (accessibility focus)

	---

	## 💻 Hardware Requirements

	\| Use Case \| GPU \| RAM \| Storage \|
	\|----------\|-----\|-----\|---------\|
	\| Inference \| 4GB+ VRAM \| 8GB \| 5GB \|
	\| Training \| 16GB VRAM \| 32GB \| 50GB \|

	Recommended for Inference: RTX 3060+ or equivalent

	---

	## ⚠️ Limitations

	1. Scope: Optimized for navigation; may underperform on general VQA
	2. Image Quality: Best with well-lit, clear images
	3. OCR: Works best with printed text; struggles with handwriting
	4. Speed: Requires GPU for real-time use (CPU: 10-20s/image)
	5. Language: English only

	### Safety Notice

	⚠️ This is an assistive tool, not a replacement for traditional navigation aids. Users should:
	- Combine with cane, guide dog, or other mobility aids
	- Exercise human judgment
	- Test in safe environments first
	- Be aware of potential errors

	---

	## 🎓 Model Card

	### Model Details

	- Type: Vision-Language Model (Idefics3)
	- Parameters: 500M total, 42M trainable (1.84%)
	- Input: Image + Text
	- Output: Text
	- License: Apache 2.0

	### Intended Use

	Primary:
	- Navigation assistance for blind/visually impaired
	- Spatial reasoning and object localization
	- Scene understanding and description
	- Text recognition in natural environments
	- Accessibility research

	Out of Scope:
	- Medical diagnosis
	- Autonomous navigation without human oversight
	- Real-time video processing
	- General-purpose VQA (use base model)

	### Ethical Considerations

	- Designed to enhance independence, not replace human judgment
	- May have biases from English-only training data
	- Requires validation in real-world scenarios
	- Processes images locally (no data collection)

	---

	## 📖 Citation

	```bibtex
	@misc{alqahtani2025smolvlm_navigation,
	author = {Alqahtani, Muhammad Said},
	title = {SmolVLM Navigation Assistant: Fine-tuned for Blind Navigation},
	year = {2025},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned}}
	}

	@mastersthesis{alqahtani2025thesis,
	author = {Alqahtani, Muhammad Said},
	title = {An Efficient Multi-Object Detection and Smart Navigation Using Vision Language Models for Visually Impaired},
	school = {Asia Pacific University of Technology and Innovation},
	year = {2025},
	address = {Kuala Lumpur, Malaysia}
	}
	```

	---

	## 🙏 Acknowledgments

	Supervision:
	- Dr. Raheem Mafas (Research Supervisor)
	- Asia Pacific University

	Technical:
	- HuggingFace Team (base model & libraries)
	- Unsloth (training framework)
	- NVIDIA (GPU hardware)

	Datasets:
	- Stanford Visual Genome
	- GQA, VizWiz, TextCaps
	- Localized Narratives

	---

	## 📫 Contact

	Author: Mohammad Mohamed Said Aly Amin
	Institution: Asia Pacific University
	Issues: [Model Discussions](https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned/discussions)

	---

	<div align="center">

	Made with ❤️ for accessibility and inclusion

	[![HuggingFace](https://img.shields.io/badge/🤗-Model-yellow)](https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned)
	[![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)

	Empowering independence through AI-powered vision assistance

	</div>