SmolVLM Navigation Assistant π¦―
Fine-tuned vision-language model for blind navigation assistance
Quick Start β’ Performance β’ Usage β’ Training β’ Citation
π Overview
Fine-tuned SmolVLM-500M-Instruct for vision-based navigation assistance for blind and visually impaired users. Developed as a Master's thesis project at Asia Pacific University.
Key Results:
- π― 91.6% BERTScore (semantic accuracy)
- π +3483% BLEU-1 improvement over baseline
- β‘ 0.5-1s inference time
- πΎ 2-4GB VRAM requirement
- π p < 0.001 statistical significance
Author: Mohammad Mohamed Said Aly Amin
Supervisor: Dr. Raheem Mafas
Institution: Asia Pacific University
Program: Master's in Data Science & Business Analytics
β¨ Features
Three Navigation Modes
| Mode | Purpose | Response Length | Example Query |
|---|---|---|---|
| π― FOCUSED | Spatial relationships | 5-15 words | "Is there a chair to my left?" |
| π SCENE | Environment description | 30-50 words | "Describe what's in front of me" |
| π OCR | Text recognition | Variable | "What does the sign say?" |
Technical Highlights
- β Real-time inference on consumer GPUs
- β Low memory footprint (2-4GB VRAM)
- β Statistically validated improvements
- β Production-ready deployment
- β QLoRA efficient fine-tuning (1.84% parameters)
π Performance
Evaluation Results (500 samples)
| Metric | Fine-tuned | Baseline | Improvement |
|---|---|---|---|
| BLEU | 0.234 | - | - |
| BLEU-1 | 24.89 | 0.69 | +3483% π |
| ROUGE-1 | 55.72 | 13.66 | +308% |
| ROUGE-2 | 32.46 | 2.69 | +1105% |
| ROUGE-L | 48.27 | 11.82 | +308% |
| BERTScore | 91.63 | 85.60 | +7.04% |
| Length Ratio | 0.93 | - | Nearly perfect |
Statistical Validation: All improvements significant at p < 0.001 (paired t-test, n=500)
Loss Convergence
- Initial Training Loss: 0.29 β Final: 0.12 (58% reduction)
- Initial Val Loss: 0.24 β Final: 0.13 (46% reduction)
π Quick Start
Installation
pip install transformers torch pillow accelerate
Basic Usage
from transformers import Idefics3ForConditionalGeneration, AutoProcessor
from PIL import Image
import torch
# Load model
model = Idefics3ForConditionalGeneration.from_pretrained(
"msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
"msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
trust_remote_code=True
)
# Prepare input
image = Image.open("scene.jpg")
messages = [{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What do you see?"}
]
}]
# Generate
prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=150,
do_sample=False,
pad_token_id=processor.tokenizer.eos_token_id,
eos_token_id=processor.tokenizer.eos_token_id
)
response = processor.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
π‘ Usage Examples
FOCUSED: Spatial Queries
messages = [{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Is there a chair to the left of the table?"}
]
}]
# Output: "Yes, there is a chair to the left of the table."
SCENE: Environment Description
messages = [{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Describe the scene in front of me."}
]
}]
# Output: "The scene shows a living room with a brown sofa on the left,
# a wooden coffee table in the center, and a TV on the wall..."
OCR: Text Reading
messages = [{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What text is on the sign?"}
]
}]
# Output: "The sign says 'EXIT' in red letters."
Memory Optimization
# 8-bit quantization (reduces to ~2GB VRAM)
model = Idefics3ForConditionalGeneration.from_pretrained(
"msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
load_in_8bit=True,
device_map="auto"
)
# Batch processing
inputs = processor(
text=[prompt1, prompt2, prompt3],
images=[[img1], [img2], [img3]],
return_tensors="pt",
padding=True
)
π οΈ Training Details
Configuration
| Parameter | Value | Description |
|---|---|---|
| Base Model | SmolVLM-500M-Instruct | 500M parameters |
| Method | QLoRA | 4-bit quantization |
| Trainable Params | 42M (1.84%) | LoRA adapters only |
| LoRA Rank | 32 | Adapter dimension |
| LoRA Alpha | 64 | Scaling factor |
| Epochs | 3 | Full data passes |
| Batch Size | 1 (effective: 16) | With gradient accumulation |
| Learning Rate | 2e-5 | AdamW optimizer |
| Precision | BF16 | Mixed precision |
| GPU | RTX 5070 Ti 16GB | Training hardware |
| Training Time | ~20 hours | Total duration |
| Peak VRAM | 7-9GB | During training |
Dataset
Size: 10,000+ samples across three modes
Sources:
- GQA Enhanced (spatial reasoning)
- Localized Narratives (scene descriptions)
- Visual Genome (object relationships)
- TextCaps (text-in-image)
- VizWiz (accessibility focus)
π» Hardware Requirements
| Use Case | GPU | RAM | Storage |
|---|---|---|---|
| Inference | 4GB+ VRAM | 8GB | 5GB |
| Training | 16GB VRAM | 32GB | 50GB |
Recommended for Inference: RTX 3060+ or equivalent
β οΈ Limitations
- Scope: Optimized for navigation; may underperform on general VQA
- Image Quality: Best with well-lit, clear images
- OCR: Works best with printed text; struggles with handwriting
- Speed: Requires GPU for real-time use (CPU: 10-20s/image)
- Language: English only
Safety Notice
β οΈ This is an assistive tool, not a replacement for traditional navigation aids. Users should:
- Combine with cane, guide dog, or other mobility aids
- Exercise human judgment
- Test in safe environments first
- Be aware of potential errors
π Model Card
Model Details
- Type: Vision-Language Model (Idefics3)
- Parameters: 500M total, 42M trainable (1.84%)
- Input: Image + Text
- Output: Text
- License: Apache 2.0
Intended Use
Primary:
- Navigation assistance for blind/visually impaired
- Spatial reasoning and object localization
- Scene understanding and description
- Text recognition in natural environments
- Accessibility research
Out of Scope:
- Medical diagnosis
- Autonomous navigation without human oversight
- Real-time video processing
- General-purpose VQA (use base model)
Ethical Considerations
- Designed to enhance independence, not replace human judgment
- May have biases from English-only training data
- Requires validation in real-world scenarios
- Processes images locally (no data collection)
π Citation
@misc{alqahtani2025smolvlm_navigation,
author = {Alqahtani, Muhammad Said},
title = {SmolVLM Navigation Assistant: Fine-tuned for Blind Navigation},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned}}
}
@mastersthesis{alqahtani2025thesis,
author = {Alqahtani, Muhammad Said},
title = {An Efficient Multi-Object Detection and Smart Navigation Using Vision Language Models for Visually Impaired},
school = {Asia Pacific University of Technology and Innovation},
year = {2025},
address = {Kuala Lumpur, Malaysia}
}
π Acknowledgments
Supervision:
- Dr. Raheem Mafas (Research Supervisor)
- Asia Pacific University
Technical:
- HuggingFace Team (base model & libraries)
- Unsloth (training framework)
- NVIDIA (GPU hardware)
Datasets:
- Stanford Visual Genome
- GQA, VizWiz, TextCaps
- Localized Narratives
π« Contact
Author: Mohammad Mohamed Said Aly Amin
Institution: Asia Pacific University
Issues: Model Discussions
Model tree for msaid1976/SmolVLM-Instruct-Navigation-FineTuned
Base model
HuggingFaceTB/SmolLM2-360M