---
language: en
license: apache-2.0
base_model: HuggingFaceTB/SmolVLM-500M-Instruct
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- Vision
- Image-to-text
- Multimodal
- Vision-language-model
- Navigation
- Accessibility
- Assistive-technology
- Blind-assistance
- Fine-tuned
- SmolVLM
---
# SmolVLM Navigation Assistant 🦯
[](https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct)
[](https://www.apache.org/licenses/LICENSE-2.0)
[](https://huggingface.co/metrics/bertscore)
**Fine-tuned vision-language model for blind navigation assistance**
[Quick Start](#-quick-start) • [Performance](#-performance) • [Usage](#-usage) • [Training](#-training-details) • [Citation](#-citation)
---
## 📋 Overview
Fine-tuned [SmolVLM-500M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct) for **vision-based navigation assistance** for blind and visually impaired users. Developed as a Master's thesis project at Asia Pacific University.
**Key Results:**
- 🎯 **91.6% BERTScore** (semantic accuracy)
- 🚀 **+3483% BLEU-1** improvement over baseline
- ⚡ **0.5-1s inference** time
- 💾 **2-4GB VRAM** requirement
- 📊 **p < 0.001** statistical significance
**Author:** Mohammad Mohamed Said Aly Amin
**Supervisor:** Dr. Raheem Mafas
**Institution:** Asia Pacific University
**Program:** Master's in Data Science & Business Analytics
---
## ✨ Features
### Three Navigation Modes
| Mode | Purpose | Response Length | Example Query |
|------|---------|-----------------|---------------|
| **🎯 FOCUSED** | Spatial relationships | 5-15 words | "Is there a chair to my left?" |
| **🌍 SCENE** | Environment description | 30-50 words | "Describe what's in front of me" |
| **📝 OCR** | Text recognition | Variable | "What does the sign say?" |
### Technical Highlights
- ✅ Real-time inference on consumer GPUs
- ✅ Low memory footprint (2-4GB VRAM)
- ✅ Statistically validated improvements
- ✅ Production-ready deployment
- ✅ QLoRA efficient fine-tuning (1.84% parameters)
---
## 📊 Performance
### Evaluation Results (500 samples)
| Metric | Fine-tuned | Baseline | Improvement |
|--------|-----------|----------|-------------|
| **BLEU** | 0.234 | - | - |
| **BLEU-1** | 24.89 | 0.69 | **+3483%** 🚀 |
| **ROUGE-1** | 55.72 | 13.66 | **+308%** |
| **ROUGE-2** | 32.46 | 2.69 | **+1105%** |
| **ROUGE-L** | 48.27 | 11.82 | **+308%** |
| **BERTScore** | 91.63 | 85.60 | **+7.04%** |
| **Length Ratio** | 0.93 | - | Nearly perfect |
**Statistical Validation:** All improvements significant at p < 0.001 (paired t-test, n=500)
### Loss Convergence
- Initial Training Loss: **0.29** → Final: **0.12** (58% reduction)
- Initial Val Loss: **0.24** → Final: **0.13** (46% reduction)
---
## 🚀 Quick Start
### Installation
```bash
pip install transformers torch pillow accelerate
```
### Basic Usage
```python
from transformers import Idefics3ForConditionalGeneration, AutoProcessor
from PIL import Image
import torch
# Load model
model = Idefics3ForConditionalGeneration.from_pretrained(
"msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
"msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
trust_remote_code=True
)
# Prepare input
image = Image.open("scene.jpg")
messages = [{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What do you see?"}
]
}]
# Generate
prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=150,
do_sample=False,
pad_token_id=processor.tokenizer.eos_token_id,
eos_token_id=processor.tokenizer.eos_token_id
)
response = processor.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
```
---
## 💡 Usage Examples
### FOCUSED: Spatial Queries
```python
messages = [{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Is there a chair to the left of the table?"}
]
}]
# Output: "Yes, there is a chair to the left of the table."
```
### SCENE: Environment Description
```python
messages = [{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Describe the scene in front of me."}
]
}]
# Output: "The scene shows a living room with a brown sofa on the left,
# a wooden coffee table in the center, and a TV on the wall..."
```
### OCR: Text Reading
```python
messages = [{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What text is on the sign?"}
]
}]
# Output: "The sign says 'EXIT' in red letters."
```
### Memory Optimization
```python
# 8-bit quantization (reduces to ~2GB VRAM)
model = Idefics3ForConditionalGeneration.from_pretrained(
"msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
load_in_8bit=True,
device_map="auto"
)
# Batch processing
inputs = processor(
text=[prompt1, prompt2, prompt3],
images=[[img1], [img2], [img3]],
return_tensors="pt",
padding=True
)
```
---
## 🛠️ Training Details
### Configuration
| Parameter | Value | Description |
|-----------|-------|-------------|
| **Base Model** | SmolVLM-500M-Instruct | 500M parameters |
| **Method** | QLoRA | 4-bit quantization |
| **Trainable Params** | 42M (1.84%) | LoRA adapters only |
| **LoRA Rank** | 32 | Adapter dimension |
| **LoRA Alpha** | 64 | Scaling factor |
| **Epochs** | 3 | Full data passes |
| **Batch Size** | 1 (effective: 16) | With gradient accumulation |
| **Learning Rate** | 2e-5 | AdamW optimizer |
| **Precision** | BF16 | Mixed precision |
| **GPU** | RTX 5070 Ti 16GB | Training hardware |
| **Training Time** | ~20 hours | Total duration |
| **Peak VRAM** | 7-9GB | During training |
### Dataset
**Size:** 10,000+ samples across three modes
**Sources:**
- GQA Enhanced (spatial reasoning)
- Localized Narratives (scene descriptions)
- Visual Genome (object relationships)
- TextCaps (text-in-image)
- VizWiz (accessibility focus)
---
## 💻 Hardware Requirements
| Use Case | GPU | RAM | Storage |
|----------|-----|-----|---------|
| **Inference** | 4GB+ VRAM | 8GB | 5GB |
| **Training** | 16GB VRAM | 32GB | 50GB |
**Recommended for Inference:** RTX 3060+ or equivalent
---
## ⚠️ Limitations
1. **Scope:** Optimized for navigation; may underperform on general VQA
2. **Image Quality:** Best with well-lit, clear images
3. **OCR:** Works best with printed text; struggles with handwriting
4. **Speed:** Requires GPU for real-time use (CPU: 10-20s/image)
5. **Language:** English only
### Safety Notice
⚠️ **This is an assistive tool, not a replacement for traditional navigation aids.** Users should:
- Combine with cane, guide dog, or other mobility aids
- Exercise human judgment
- Test in safe environments first
- Be aware of potential errors
---
## 🎓 Model Card
### Model Details
- **Type:** Vision-Language Model (Idefics3)
- **Parameters:** 500M total, 42M trainable (1.84%)
- **Input:** Image + Text
- **Output:** Text
- **License:** Apache 2.0
### Intended Use
**Primary:**
- Navigation assistance for blind/visually impaired
- Spatial reasoning and object localization
- Scene understanding and description
- Text recognition in natural environments
- Accessibility research
**Out of Scope:**
- Medical diagnosis
- Autonomous navigation without human oversight
- Real-time video processing
- General-purpose VQA (use base model)
### Ethical Considerations
- Designed to enhance independence, not replace human judgment
- May have biases from English-only training data
- Requires validation in real-world scenarios
- Processes images locally (no data collection)
---
## 📖 Citation
```bibtex
@misc{alqahtani2025smolvlm_navigation,
author = {Alqahtani, Muhammad Said},
title = {SmolVLM Navigation Assistant: Fine-tuned for Blind Navigation},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned}}
}
@mastersthesis{alqahtani2025thesis,
author = {Alqahtani, Muhammad Said},
title = {An Efficient Multi-Object Detection and Smart Navigation Using Vision Language Models for Visually Impaired},
school = {Asia Pacific University of Technology and Innovation},
year = {2025},
address = {Kuala Lumpur, Malaysia}
}
```
---
## 🙏 Acknowledgments
**Supervision:**
- Dr. Raheem Mafas (Research Supervisor)
- Asia Pacific University
**Technical:**
- HuggingFace Team (base model & libraries)
- Unsloth (training framework)
- NVIDIA (GPU hardware)
**Datasets:**
- Stanford Visual Genome
- GQA, VizWiz, TextCaps
- Localized Narratives
---
## 📫 Contact
**Author:** Mohammad Mohamed Said Aly Amin
**Institution:** Asia Pacific University
**Issues:** [Model Discussions](https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned/discussions)
---
**Made with ❤️ for accessibility and inclusion**
[](https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned)
[](LICENSE)
*Empowering independence through AI-powered vision assistance*