|
|
--- |
|
|
language: en |
|
|
license: apache-2.0 |
|
|
base_model: HuggingFaceTB/SmolVLM-500M-Instruct |
|
|
library_name: transformers |
|
|
pipeline_tag: image-text-to-text |
|
|
tags: |
|
|
- Vision |
|
|
- Image-to-text |
|
|
- Multimodal |
|
|
- Vision-language-model |
|
|
- Navigation |
|
|
- Accessibility |
|
|
- Assistive-technology |
|
|
- Blind-assistance |
|
|
- Fine-tuned |
|
|
- SmolVLM |
|
|
--- |
|
|
|
|
|
# SmolVLM Navigation Assistant 🦯 |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
[](https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct) |
|
|
[](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
[](https://huggingface.co/metrics/bertscore) |
|
|
|
|
|
**Fine-tuned vision-language model for blind navigation assistance** |
|
|
|
|
|
[Quick Start](#-quick-start) • [Performance](#-performance) • [Usage](#-usage) • [Training](#-training-details) • [Citation](#-citation) |
|
|
|
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## 📋 Overview |
|
|
|
|
|
Fine-tuned [SmolVLM-500M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct) for **vision-based navigation assistance** for blind and visually impaired users. Developed as a Master's thesis project at Asia Pacific University. |
|
|
|
|
|
**Key Results:** |
|
|
- 🎯 **91.6% BERTScore** (semantic accuracy) |
|
|
- 🚀 **+3483% BLEU-1** improvement over baseline |
|
|
- ⚡ **0.5-1s inference** time |
|
|
- 💾 **2-4GB VRAM** requirement |
|
|
- 📊 **p < 0.001** statistical significance |
|
|
|
|
|
**Author:** Mohammad Mohamed Said Aly Amin |
|
|
**Supervisor:** Dr. Raheem Mafas |
|
|
**Institution:** Asia Pacific University |
|
|
**Program:** Master's in Data Science & Business Analytics |
|
|
|
|
|
--- |
|
|
|
|
|
## ✨ Features |
|
|
|
|
|
### Three Navigation Modes |
|
|
|
|
|
| Mode | Purpose | Response Length | Example Query | |
|
|
|------|---------|-----------------|---------------| |
|
|
| **🎯 FOCUSED** | Spatial relationships | 5-15 words | "Is there a chair to my left?" | |
|
|
| **🌍 SCENE** | Environment description | 30-50 words | "Describe what's in front of me" | |
|
|
| **📝 OCR** | Text recognition | Variable | "What does the sign say?" | |
|
|
|
|
|
### Technical Highlights |
|
|
|
|
|
- ✅ Real-time inference on consumer GPUs |
|
|
- ✅ Low memory footprint (2-4GB VRAM) |
|
|
- ✅ Statistically validated improvements |
|
|
- ✅ Production-ready deployment |
|
|
- ✅ QLoRA efficient fine-tuning (1.84% parameters) |
|
|
|
|
|
--- |
|
|
|
|
|
## 📊 Performance |
|
|
|
|
|
### Evaluation Results (500 samples) |
|
|
|
|
|
| Metric | Fine-tuned | Baseline | Improvement | |
|
|
|--------|-----------|----------|-------------| |
|
|
| **BLEU** | 0.234 | - | - | |
|
|
| **BLEU-1** | 24.89 | 0.69 | **+3483%** 🚀 | |
|
|
| **ROUGE-1** | 55.72 | 13.66 | **+308%** | |
|
|
| **ROUGE-2** | 32.46 | 2.69 | **+1105%** | |
|
|
| **ROUGE-L** | 48.27 | 11.82 | **+308%** | |
|
|
| **BERTScore** | 91.63 | 85.60 | **+7.04%** | |
|
|
| **Length Ratio** | 0.93 | - | Nearly perfect | |
|
|
|
|
|
**Statistical Validation:** All improvements significant at p < 0.001 (paired t-test, n=500) |
|
|
|
|
|
### Loss Convergence |
|
|
|
|
|
- Initial Training Loss: **0.29** → Final: **0.12** (58% reduction) |
|
|
- Initial Val Loss: **0.24** → Final: **0.13** (46% reduction) |
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 Quick Start |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch pillow accelerate |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
from transformers import Idefics3ForConditionalGeneration, AutoProcessor |
|
|
from PIL import Image |
|
|
import torch |
|
|
|
|
|
# Load model |
|
|
model = Idefics3ForConditionalGeneration.from_pretrained( |
|
|
"msaid1976/SmolVLM-Instruct-Navigation-FineTuned", |
|
|
torch_dtype=torch.float16, |
|
|
device_map="auto", |
|
|
trust_remote_code=True |
|
|
) |
|
|
processor = AutoProcessor.from_pretrained( |
|
|
"msaid1976/SmolVLM-Instruct-Navigation-FineTuned", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
# Prepare input |
|
|
image = Image.open("scene.jpg") |
|
|
messages = [{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image"}, |
|
|
{"type": "text", "text": "What do you see?"} |
|
|
] |
|
|
}] |
|
|
|
|
|
# Generate |
|
|
prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) |
|
|
inputs = processor(text=prompt, images=[image], return_tensors="pt") |
|
|
inputs = {k: v.to("cuda") for k, v in inputs.items()} |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=150, |
|
|
do_sample=False, |
|
|
pad_token_id=processor.tokenizer.eos_token_id, |
|
|
eos_token_id=processor.tokenizer.eos_token_id |
|
|
) |
|
|
|
|
|
response = processor.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 💡 Usage Examples |
|
|
|
|
|
### FOCUSED: Spatial Queries |
|
|
|
|
|
```python |
|
|
messages = [{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image"}, |
|
|
{"type": "text", "text": "Is there a chair to the left of the table?"} |
|
|
] |
|
|
}] |
|
|
# Output: "Yes, there is a chair to the left of the table." |
|
|
``` |
|
|
|
|
|
### SCENE: Environment Description |
|
|
|
|
|
```python |
|
|
messages = [{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image"}, |
|
|
{"type": "text", "text": "Describe the scene in front of me."} |
|
|
] |
|
|
}] |
|
|
# Output: "The scene shows a living room with a brown sofa on the left, |
|
|
# a wooden coffee table in the center, and a TV on the wall..." |
|
|
``` |
|
|
|
|
|
### OCR: Text Reading |
|
|
|
|
|
```python |
|
|
messages = [{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image"}, |
|
|
{"type": "text", "text": "What text is on the sign?"} |
|
|
] |
|
|
}] |
|
|
# Output: "The sign says 'EXIT' in red letters." |
|
|
``` |
|
|
|
|
|
### Memory Optimization |
|
|
|
|
|
```python |
|
|
# 8-bit quantization (reduces to ~2GB VRAM) |
|
|
model = Idefics3ForConditionalGeneration.from_pretrained( |
|
|
"msaid1976/SmolVLM-Instruct-Navigation-FineTuned", |
|
|
load_in_8bit=True, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
# Batch processing |
|
|
inputs = processor( |
|
|
text=[prompt1, prompt2, prompt3], |
|
|
images=[[img1], [img2], [img3]], |
|
|
return_tensors="pt", |
|
|
padding=True |
|
|
) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 🛠️ Training Details |
|
|
|
|
|
### Configuration |
|
|
|
|
|
| Parameter | Value | Description | |
|
|
|-----------|-------|-------------| |
|
|
| **Base Model** | SmolVLM-500M-Instruct | 500M parameters | |
|
|
| **Method** | QLoRA | 4-bit quantization | |
|
|
| **Trainable Params** | 42M (1.84%) | LoRA adapters only | |
|
|
| **LoRA Rank** | 32 | Adapter dimension | |
|
|
| **LoRA Alpha** | 64 | Scaling factor | |
|
|
| **Epochs** | 3 | Full data passes | |
|
|
| **Batch Size** | 1 (effective: 16) | With gradient accumulation | |
|
|
| **Learning Rate** | 2e-5 | AdamW optimizer | |
|
|
| **Precision** | BF16 | Mixed precision | |
|
|
| **GPU** | RTX 5070 Ti 16GB | Training hardware | |
|
|
| **Training Time** | ~20 hours | Total duration | |
|
|
| **Peak VRAM** | 7-9GB | During training | |
|
|
|
|
|
### Dataset |
|
|
|
|
|
**Size:** 10,000+ samples across three modes |
|
|
|
|
|
**Sources:** |
|
|
- GQA Enhanced (spatial reasoning) |
|
|
- Localized Narratives (scene descriptions) |
|
|
- Visual Genome (object relationships) |
|
|
- TextCaps (text-in-image) |
|
|
- VizWiz (accessibility focus) |
|
|
|
|
|
--- |
|
|
|
|
|
## 💻 Hardware Requirements |
|
|
|
|
|
| Use Case | GPU | RAM | Storage | |
|
|
|----------|-----|-----|---------| |
|
|
| **Inference** | 4GB+ VRAM | 8GB | 5GB | |
|
|
| **Training** | 16GB VRAM | 32GB | 50GB | |
|
|
|
|
|
**Recommended for Inference:** RTX 3060+ or equivalent |
|
|
|
|
|
--- |
|
|
|
|
|
## ⚠️ Limitations |
|
|
|
|
|
1. **Scope:** Optimized for navigation; may underperform on general VQA |
|
|
2. **Image Quality:** Best with well-lit, clear images |
|
|
3. **OCR:** Works best with printed text; struggles with handwriting |
|
|
4. **Speed:** Requires GPU for real-time use (CPU: 10-20s/image) |
|
|
5. **Language:** English only |
|
|
|
|
|
### Safety Notice |
|
|
|
|
|
⚠️ **This is an assistive tool, not a replacement for traditional navigation aids.** Users should: |
|
|
- Combine with cane, guide dog, or other mobility aids |
|
|
- Exercise human judgment |
|
|
- Test in safe environments first |
|
|
- Be aware of potential errors |
|
|
|
|
|
--- |
|
|
|
|
|
## 🎓 Model Card |
|
|
|
|
|
### Model Details |
|
|
|
|
|
- **Type:** Vision-Language Model (Idefics3) |
|
|
- **Parameters:** 500M total, 42M trainable (1.84%) |
|
|
- **Input:** Image + Text |
|
|
- **Output:** Text |
|
|
- **License:** Apache 2.0 |
|
|
|
|
|
### Intended Use |
|
|
|
|
|
**Primary:** |
|
|
- Navigation assistance for blind/visually impaired |
|
|
- Spatial reasoning and object localization |
|
|
- Scene understanding and description |
|
|
- Text recognition in natural environments |
|
|
- Accessibility research |
|
|
|
|
|
**Out of Scope:** |
|
|
- Medical diagnosis |
|
|
- Autonomous navigation without human oversight |
|
|
- Real-time video processing |
|
|
- General-purpose VQA (use base model) |
|
|
|
|
|
### Ethical Considerations |
|
|
|
|
|
- Designed to enhance independence, not replace human judgment |
|
|
- May have biases from English-only training data |
|
|
- Requires validation in real-world scenarios |
|
|
- Processes images locally (no data collection) |
|
|
|
|
|
--- |
|
|
|
|
|
## 📖 Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{alqahtani2025smolvlm_navigation, |
|
|
author = {Alqahtani, Muhammad Said}, |
|
|
title = {SmolVLM Navigation Assistant: Fine-tuned for Blind Navigation}, |
|
|
year = {2025}, |
|
|
publisher = {HuggingFace}, |
|
|
howpublished = {\url{https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned}} |
|
|
} |
|
|
|
|
|
@mastersthesis{alqahtani2025thesis, |
|
|
author = {Alqahtani, Muhammad Said}, |
|
|
title = {An Efficient Multi-Object Detection and Smart Navigation Using Vision Language Models for Visually Impaired}, |
|
|
school = {Asia Pacific University of Technology and Innovation}, |
|
|
year = {2025}, |
|
|
address = {Kuala Lumpur, Malaysia} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 🙏 Acknowledgments |
|
|
|
|
|
**Supervision:** |
|
|
- Dr. Raheem Mafas (Research Supervisor) |
|
|
- Asia Pacific University |
|
|
|
|
|
**Technical:** |
|
|
- HuggingFace Team (base model & libraries) |
|
|
- Unsloth (training framework) |
|
|
- NVIDIA (GPU hardware) |
|
|
|
|
|
**Datasets:** |
|
|
- Stanford Visual Genome |
|
|
- GQA, VizWiz, TextCaps |
|
|
- Localized Narratives |
|
|
|
|
|
--- |
|
|
|
|
|
## 📫 Contact |
|
|
|
|
|
**Author:** Mohammad Mohamed Said Aly Amin |
|
|
**Institution:** Asia Pacific University |
|
|
**Issues:** [Model Discussions](https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned/discussions) |
|
|
|
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**Made with ❤️ for accessibility and inclusion** |
|
|
|
|
|
[](https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned) |
|
|
[](LICENSE) |
|
|
|
|
|
*Empowering independence through AI-powered vision assistance* |
|
|
|
|
|
</div> |