SmolVLM Navigation Assistant 🦯

Model License BERTScore

Fine-tuned vision-language model for blind navigation assistance

Quick Start β€’ Performance β€’ Usage β€’ Training β€’ Citation


πŸ“‹ Overview

Fine-tuned SmolVLM-500M-Instruct for vision-based navigation assistance for blind and visually impaired users. Developed as a Master's thesis project at Asia Pacific University.

Key Results:

  • 🎯 91.6% BERTScore (semantic accuracy)
  • πŸš€ +3483% BLEU-1 improvement over baseline
  • ⚑ 0.5-1s inference time
  • πŸ’Ύ 2-4GB VRAM requirement
  • πŸ“Š p < 0.001 statistical significance

Author: Mohammad Mohamed Said Aly Amin
Supervisor: Dr. Raheem Mafas
Institution: Asia Pacific University
Program: Master's in Data Science & Business Analytics


✨ Features

Three Navigation Modes

Mode Purpose Response Length Example Query
🎯 FOCUSED Spatial relationships 5-15 words "Is there a chair to my left?"
🌍 SCENE Environment description 30-50 words "Describe what's in front of me"
πŸ“ OCR Text recognition Variable "What does the sign say?"

Technical Highlights

  • βœ… Real-time inference on consumer GPUs
  • βœ… Low memory footprint (2-4GB VRAM)
  • βœ… Statistically validated improvements
  • βœ… Production-ready deployment
  • βœ… QLoRA efficient fine-tuning (1.84% parameters)

πŸ“Š Performance

Evaluation Results (500 samples)

Metric Fine-tuned Baseline Improvement
BLEU 0.234 - -
BLEU-1 24.89 0.69 +3483% πŸš€
ROUGE-1 55.72 13.66 +308%
ROUGE-2 32.46 2.69 +1105%
ROUGE-L 48.27 11.82 +308%
BERTScore 91.63 85.60 +7.04%
Length Ratio 0.93 - Nearly perfect

Statistical Validation: All improvements significant at p < 0.001 (paired t-test, n=500)

Loss Convergence

  • Initial Training Loss: 0.29 β†’ Final: 0.12 (58% reduction)
  • Initial Val Loss: 0.24 β†’ Final: 0.13 (46% reduction)

πŸš€ Quick Start

Installation

pip install transformers torch pillow accelerate

Basic Usage

from transformers import Idefics3ForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Load model
model = Idefics3ForConditionalGeneration.from_pretrained(
    "msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
    "msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
    trust_remote_code=True
)

# Prepare input
image = Image.open("scene.jpg")
messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "What do you see?"}
    ]
}]

# Generate
prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        do_sample=False,
        pad_token_id=processor.tokenizer.eos_token_id,
        eos_token_id=processor.tokenizer.eos_token_id
    )

response = processor.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

πŸ’‘ Usage Examples

FOCUSED: Spatial Queries

messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "Is there a chair to the left of the table?"}
    ]
}]
# Output: "Yes, there is a chair to the left of the table."

SCENE: Environment Description

messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe the scene in front of me."}
    ]
}]
# Output: "The scene shows a living room with a brown sofa on the left, 
# a wooden coffee table in the center, and a TV on the wall..."

OCR: Text Reading

messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "What text is on the sign?"}
    ]
}]
# Output: "The sign says 'EXIT' in red letters."

Memory Optimization

# 8-bit quantization (reduces to ~2GB VRAM)
model = Idefics3ForConditionalGeneration.from_pretrained(
    "msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
    load_in_8bit=True,
    device_map="auto"
)

# Batch processing
inputs = processor(
    text=[prompt1, prompt2, prompt3],
    images=[[img1], [img2], [img3]],
    return_tensors="pt",
    padding=True
)

πŸ› οΈ Training Details

Configuration

Parameter Value Description
Base Model SmolVLM-500M-Instruct 500M parameters
Method QLoRA 4-bit quantization
Trainable Params 42M (1.84%) LoRA adapters only
LoRA Rank 32 Adapter dimension
LoRA Alpha 64 Scaling factor
Epochs 3 Full data passes
Batch Size 1 (effective: 16) With gradient accumulation
Learning Rate 2e-5 AdamW optimizer
Precision BF16 Mixed precision
GPU RTX 5070 Ti 16GB Training hardware
Training Time ~20 hours Total duration
Peak VRAM 7-9GB During training

Dataset

Size: 10,000+ samples across three modes

Sources:

  • GQA Enhanced (spatial reasoning)
  • Localized Narratives (scene descriptions)
  • Visual Genome (object relationships)
  • TextCaps (text-in-image)
  • VizWiz (accessibility focus)

πŸ’» Hardware Requirements

Use Case GPU RAM Storage
Inference 4GB+ VRAM 8GB 5GB
Training 16GB VRAM 32GB 50GB

Recommended for Inference: RTX 3060+ or equivalent


⚠️ Limitations

  1. Scope: Optimized for navigation; may underperform on general VQA
  2. Image Quality: Best with well-lit, clear images
  3. OCR: Works best with printed text; struggles with handwriting
  4. Speed: Requires GPU for real-time use (CPU: 10-20s/image)
  5. Language: English only

Safety Notice

⚠️ This is an assistive tool, not a replacement for traditional navigation aids. Users should:

  • Combine with cane, guide dog, or other mobility aids
  • Exercise human judgment
  • Test in safe environments first
  • Be aware of potential errors

πŸŽ“ Model Card

Model Details

  • Type: Vision-Language Model (Idefics3)
  • Parameters: 500M total, 42M trainable (1.84%)
  • Input: Image + Text
  • Output: Text
  • License: Apache 2.0

Intended Use

Primary:

  • Navigation assistance for blind/visually impaired
  • Spatial reasoning and object localization
  • Scene understanding and description
  • Text recognition in natural environments
  • Accessibility research

Out of Scope:

  • Medical diagnosis
  • Autonomous navigation without human oversight
  • Real-time video processing
  • General-purpose VQA (use base model)

Ethical Considerations

  • Designed to enhance independence, not replace human judgment
  • May have biases from English-only training data
  • Requires validation in real-world scenarios
  • Processes images locally (no data collection)

πŸ“– Citation

@misc{alqahtani2025smolvlm_navigation,
  author = {Alqahtani, Muhammad Said},
  title = {SmolVLM Navigation Assistant: Fine-tuned for Blind Navigation},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned}}
}

@mastersthesis{alqahtani2025thesis,
  author = {Alqahtani, Muhammad Said},
  title = {An Efficient Multi-Object Detection and Smart Navigation Using Vision Language Models for Visually Impaired},
  school = {Asia Pacific University of Technology and Innovation},
  year = {2025},
  address = {Kuala Lumpur, Malaysia}
}

πŸ™ Acknowledgments

Supervision:

  • Dr. Raheem Mafas (Research Supervisor)
  • Asia Pacific University

Technical:

  • HuggingFace Team (base model & libraries)
  • Unsloth (training framework)
  • NVIDIA (GPU hardware)

Datasets:

  • Stanford Visual Genome
  • GQA, VizWiz, TextCaps
  • Localized Narratives

πŸ“« Contact

Author: Mohammad Mohamed Said Aly Amin
Institution: Asia Pacific University
Issues: Model Discussions


Made with ❀️ for accessibility and inclusion

HuggingFace License

Empowering independence through AI-powered vision assistance

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for msaid1976/SmolVLM-Instruct-Navigation-FineTuned