--- language: en license: apache-2.0 base_model: HuggingFaceTB/SmolVLM-500M-Instruct library_name: transformers pipeline_tag: image-text-to-text tags: - Vision - Image-to-text - Multimodal - Vision-language-model - Navigation - Accessibility - Assistive-technology - Blind-assistance - Fine-tuned - SmolVLM --- # SmolVLM Navigation Assistant 🦯
[![Model](https://img.shields.io/badge/Model-SmolVLM--500M-blue)](https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct) [![License](https://img.shields.io/badge/License-Apache%202.0-green)](https://www.apache.org/licenses/LICENSE-2.0) [![BERTScore](https://img.shields.io/badge/BERTScore-91.6%25-brightgreen)](https://huggingface.co/metrics/bertscore) **Fine-tuned vision-language model for blind navigation assistance** [Quick Start](#-quick-start) • [Performance](#-performance) • [Usage](#-usage) • [Training](#-training-details) • [Citation](#-citation)
--- ## 📋 Overview Fine-tuned [SmolVLM-500M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct) for **vision-based navigation assistance** for blind and visually impaired users. Developed as a Master's thesis project at Asia Pacific University. **Key Results:** - 🎯 **91.6% BERTScore** (semantic accuracy) - 🚀 **+3483% BLEU-1** improvement over baseline - ⚡ **0.5-1s inference** time - 💾 **2-4GB VRAM** requirement - 📊 **p < 0.001** statistical significance **Author:** Mohammad Mohamed Said Aly Amin **Supervisor:** Dr. Raheem Mafas **Institution:** Asia Pacific University **Program:** Master's in Data Science & Business Analytics --- ## ✨ Features ### Three Navigation Modes | Mode | Purpose | Response Length | Example Query | |------|---------|-----------------|---------------| | **🎯 FOCUSED** | Spatial relationships | 5-15 words | "Is there a chair to my left?" | | **🌍 SCENE** | Environment description | 30-50 words | "Describe what's in front of me" | | **📝 OCR** | Text recognition | Variable | "What does the sign say?" | ### Technical Highlights - ✅ Real-time inference on consumer GPUs - ✅ Low memory footprint (2-4GB VRAM) - ✅ Statistically validated improvements - ✅ Production-ready deployment - ✅ QLoRA efficient fine-tuning (1.84% parameters) --- ## 📊 Performance ### Evaluation Results (500 samples) | Metric | Fine-tuned | Baseline | Improvement | |--------|-----------|----------|-------------| | **BLEU** | 0.234 | - | - | | **BLEU-1** | 24.89 | 0.69 | **+3483%** 🚀 | | **ROUGE-1** | 55.72 | 13.66 | **+308%** | | **ROUGE-2** | 32.46 | 2.69 | **+1105%** | | **ROUGE-L** | 48.27 | 11.82 | **+308%** | | **BERTScore** | 91.63 | 85.60 | **+7.04%** | | **Length Ratio** | 0.93 | - | Nearly perfect | **Statistical Validation:** All improvements significant at p < 0.001 (paired t-test, n=500) ### Loss Convergence - Initial Training Loss: **0.29** → Final: **0.12** (58% reduction) - Initial Val Loss: **0.24** → Final: **0.13** (46% reduction) --- ## 🚀 Quick Start ### Installation ```bash pip install transformers torch pillow accelerate ``` ### Basic Usage ```python from transformers import Idefics3ForConditionalGeneration, AutoProcessor from PIL import Image import torch # Load model model = Idefics3ForConditionalGeneration.from_pretrained( "msaid1976/SmolVLM-Instruct-Navigation-FineTuned", torch_dtype=torch.float16, device_map="auto", trust_remote_code=True ) processor = AutoProcessor.from_pretrained( "msaid1976/SmolVLM-Instruct-Navigation-FineTuned", trust_remote_code=True ) # Prepare input image = Image.open("scene.jpg") messages = [{ "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "What do you see?"} ] }] # Generate prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) inputs = processor(text=prompt, images=[image], return_tensors="pt") inputs = {k: v.to("cuda") for k, v in inputs.items()} with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=150, do_sample=False, pad_token_id=processor.tokenizer.eos_token_id, eos_token_id=processor.tokenizer.eos_token_id ) response = processor.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) print(response) ``` --- ## 💡 Usage Examples ### FOCUSED: Spatial Queries ```python messages = [{ "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "Is there a chair to the left of the table?"} ] }] # Output: "Yes, there is a chair to the left of the table." ``` ### SCENE: Environment Description ```python messages = [{ "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "Describe the scene in front of me."} ] }] # Output: "The scene shows a living room with a brown sofa on the left, # a wooden coffee table in the center, and a TV on the wall..." ``` ### OCR: Text Reading ```python messages = [{ "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "What text is on the sign?"} ] }] # Output: "The sign says 'EXIT' in red letters." ``` ### Memory Optimization ```python # 8-bit quantization (reduces to ~2GB VRAM) model = Idefics3ForConditionalGeneration.from_pretrained( "msaid1976/SmolVLM-Instruct-Navigation-FineTuned", load_in_8bit=True, device_map="auto" ) # Batch processing inputs = processor( text=[prompt1, prompt2, prompt3], images=[[img1], [img2], [img3]], return_tensors="pt", padding=True ) ``` --- ## 🛠️ Training Details ### Configuration | Parameter | Value | Description | |-----------|-------|-------------| | **Base Model** | SmolVLM-500M-Instruct | 500M parameters | | **Method** | QLoRA | 4-bit quantization | | **Trainable Params** | 42M (1.84%) | LoRA adapters only | | **LoRA Rank** | 32 | Adapter dimension | | **LoRA Alpha** | 64 | Scaling factor | | **Epochs** | 3 | Full data passes | | **Batch Size** | 1 (effective: 16) | With gradient accumulation | | **Learning Rate** | 2e-5 | AdamW optimizer | | **Precision** | BF16 | Mixed precision | | **GPU** | RTX 5070 Ti 16GB | Training hardware | | **Training Time** | ~20 hours | Total duration | | **Peak VRAM** | 7-9GB | During training | ### Dataset **Size:** 10,000+ samples across three modes **Sources:** - GQA Enhanced (spatial reasoning) - Localized Narratives (scene descriptions) - Visual Genome (object relationships) - TextCaps (text-in-image) - VizWiz (accessibility focus) --- ## 💻 Hardware Requirements | Use Case | GPU | RAM | Storage | |----------|-----|-----|---------| | **Inference** | 4GB+ VRAM | 8GB | 5GB | | **Training** | 16GB VRAM | 32GB | 50GB | **Recommended for Inference:** RTX 3060+ or equivalent --- ## ⚠️ Limitations 1. **Scope:** Optimized for navigation; may underperform on general VQA 2. **Image Quality:** Best with well-lit, clear images 3. **OCR:** Works best with printed text; struggles with handwriting 4. **Speed:** Requires GPU for real-time use (CPU: 10-20s/image) 5. **Language:** English only ### Safety Notice ⚠️ **This is an assistive tool, not a replacement for traditional navigation aids.** Users should: - Combine with cane, guide dog, or other mobility aids - Exercise human judgment - Test in safe environments first - Be aware of potential errors --- ## 🎓 Model Card ### Model Details - **Type:** Vision-Language Model (Idefics3) - **Parameters:** 500M total, 42M trainable (1.84%) - **Input:** Image + Text - **Output:** Text - **License:** Apache 2.0 ### Intended Use **Primary:** - Navigation assistance for blind/visually impaired - Spatial reasoning and object localization - Scene understanding and description - Text recognition in natural environments - Accessibility research **Out of Scope:** - Medical diagnosis - Autonomous navigation without human oversight - Real-time video processing - General-purpose VQA (use base model) ### Ethical Considerations - Designed to enhance independence, not replace human judgment - May have biases from English-only training data - Requires validation in real-world scenarios - Processes images locally (no data collection) --- ## 📖 Citation ```bibtex @misc{alqahtani2025smolvlm_navigation, author = {Alqahtani, Muhammad Said}, title = {SmolVLM Navigation Assistant: Fine-tuned for Blind Navigation}, year = {2025}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned}} } @mastersthesis{alqahtani2025thesis, author = {Alqahtani, Muhammad Said}, title = {An Efficient Multi-Object Detection and Smart Navigation Using Vision Language Models for Visually Impaired}, school = {Asia Pacific University of Technology and Innovation}, year = {2025}, address = {Kuala Lumpur, Malaysia} } ``` --- ## 🙏 Acknowledgments **Supervision:** - Dr. Raheem Mafas (Research Supervisor) - Asia Pacific University **Technical:** - HuggingFace Team (base model & libraries) - Unsloth (training framework) - NVIDIA (GPU hardware) **Datasets:** - Stanford Visual Genome - GQA, VizWiz, TextCaps - Localized Narratives --- ## 📫 Contact **Author:** Mohammad Mohamed Said Aly Amin **Institution:** Asia Pacific University **Issues:** [Model Discussions](https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned/discussions) ---
**Made with ❤️ for accessibility and inclusion** [![HuggingFace](https://img.shields.io/badge/🤗-Model-yellow)](https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned) [![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE) *Empowering independence through AI-powered vision assistance*