Qwen3-VL Navigation Assistant 🦯

Model License BERTScore Grade

Fine-tuned vision-language model for blind navigation assistance

Quick StartPerformanceUsageTrainingCitation


📋 Overview

Fine-tuned Qwen3-VL-2B-Instruct for vision-based navigation assistance for blind and visually impaired users. The model generates comprehensive, navigation-focused scene descriptions (25–60 words) optimized for text-to-speech delivery during active navigation. Developed as a Master's thesis project at Asia Pacific University.

Key Results:

  • 🎯 82.0% BERTScore (semantic accuracy)
  • 🚀 +2,031% BLEU improvement over baseline
  • 📏 Near-perfect length calibration (0.98 ratio to reference)
  • 📊 p < 0.001 statistical significance
  • 🏆 Grade: A− (4-level improvement from D baseline)

Author: Mohammad Mohamed Said Aly Amin
Supervisor: Dr. Raheem Mafas
Institution: Asia Pacific University of Technology and Innovation
Program: Master's in Data Science & Business Analytics


✨ Features

Comprehensive Navigation Descriptions

The model produces detailed, actionable descriptions (25–60 words) covering spatial relationships, obstacles, safety cues, and environmental context — optimized for blind navigation via text-to-speech.

Capability Description Example Query
🌍 Scene Description Comprehensive environment narratives "Describe this scene for a blind person."
🎯 Spatial Reasoning Object relationships & positioning "What obstacles should I be aware of?"
🧭 Navigation Context Pathways, directions & safety cues "Describe the navigation context of this scene."
📝 Text Recognition Signs, labels & written information "What text or signs can you see?"

Technical Highlights

  • ✅ Comprehensive 25–60 word navigation descriptions
  • ✅ Near-perfect length calibration (36.2 words vs 36.9 reference)
  • ✅ Real-time inference on consumer GPUs (2–3s/image)
  • ✅ Statistically validated improvements (p < 0.001)
  • ✅ Parameter-efficient fine-tuning via LoRA (~1.6% trainable parameters)
  • ✅ Part of multi-model navigation system (Qwen3-VL + Florence-2 + FastVLM)

📊 Performance

Evaluation Results (50 samples, paired comparison)

Metric Fine-tuned Baseline Improvement
BLEU 0.164 0.008 +2,031% 🚀
ROUGE-1 39.30 19.37 +103%
ROUGE-2 17.74 3.60 +393%
ROUGE-L 30.92 12.48 +148%
METEOR 0.335 0.211 +59%
BERTScore F1 82.0 73.4 +12%
Avg Response Length 36.2 words 92.8 words −61% (optimal)
Length Ratio (vs ref) 0.98 2.52 Near-perfect ✅
Performance Grade A− D +4 levels 🏆

Statistical Validation: Paired t-test: t = 4.31, p = 7.88 × 10⁻⁵ (p << 0.05, n = 50)

Loss Convergence

Epoch Training Loss Validation Loss Train-Val Gap
1 1.275 1.254 0.021
2 1.208 1.208 0.000
3 1.091 1.195 0.096
4 0.952 1.203 0.251
5 0.935 1.217 0.282
  • Training Loss: 1.275 → 0.935 (26.7% reduction)
  • Validation Loss: 1.254 → 1.217 (stable convergence)
  • Total Training Time: ~1h 46m (3,295 steps)

🚀 Quick Start

Installation

pip install transformers torch pillow accelerate unsloth

Basic Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "msaid1976/Qwen3-VL-2B-Navigation-FineTuned",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
    "msaid1976/Qwen3-VL-2B-Navigation-FineTuned",
    trust_remote_code=True
)

# Prepare input
image = Image.open("scene.jpg")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Describe this scene for a blind person."}
    ]
}]

# Generate
prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=False,
        pad_token_id=processor.tokenizer.eos_token_id,
        eos_token_id=processor.tokenizer.eos_token_id
    )

response = processor.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

💡 Usage Examples

Scene Description

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Describe this scene for a blind person."}
    ]
}]
# Output: "A busy intersection with pedestrians waiting at the corner of a crosswalk.
# Traffic lights control vehicle movement and there are several cars stopped in
# adjacent lanes. Buildings with retail stores line both sides of the street,
# with wide sidewalks leading in multiple directions."

Safety & Obstacle Awareness

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "What safety concerns or obstacles should I be aware of?"}
    ]
}]
# Output: "There are several orange traffic cones placed along the sidewalk creating
# a narrow walking path, with construction barriers on the left side that you
# should stay away from for safety."

Navigation Context

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Describe the navigation context of this scene for someone who is blind."}
    ]
}]
# Output: "A long corridor extending straight ahead with doorways on both the left
# and right sides. The hallway is well-lit from overhead lights and the floor
# appears clear and even. There are approximately four doors visible before
# the corridor turns."

Text & Sign Reading

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "What text or signs can you see in this image?"}
    ]
}]
# Output: "The sign says 'EXIT' in red letters above the doorway on the right side."

Memory Optimization

# 8-bit quantization (reduces VRAM usage)
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "msaid1976/Qwen3-VL-2B-Navigation-FineTuned",
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True
)

🛠️ Training Details

Configuration

Parameter Value Description
Base Model Qwen3-VL-2B-Instruct 2.16B total parameters
Framework Unsloth Optimized training framework
Method LoRA Low-Rank Adaptation
Quantization 4-bit (QLoRA compatible) Memory-efficient training
Trainable Params 34.9M (1.6%) LoRA adapters only
LoRA Rank (r) 32 Adapter dimension
LoRA Alpha 64 Scaling factor (α/r = 2)
LoRA Dropout 0.1 Regularization
Target Modules q, k, v, o, gate, up, down proj Attention + MLP layers
Epochs 5 Full data passes
Batch Size 2 (effective: 16) With 8× gradient accumulation
Learning Rate 5 × 10⁻⁵ AdamW 8-bit optimizer
LR Schedule Cosine with warmup 10% warmup ratio
Weight Decay 0.01 L2 regularization
Max Grad Norm 0.1 Gradient clipping
Max Seq Length 4,096 tokens Context window
Precision FP16 Mixed precision training
GPU RTX 5070 Ti 16GB Training hardware
Training Time ~1h 46m 3,295 steps
Peak VRAM ~14.2GB During training

Dataset

Size: 21,000 samples (17000 train / 4000 validation)
Average Response Length: 34 words (100% ≥ 25 words)
Format: Qwen3-VL native with embedded base64 images

Four-Stage Data Preparation Pipeline:

  1. Stage 1 — Intelligent Collection: Adaptive filtering with custom navigation relevance scoring across 7 source datasets
  2. Stage 2 — Quality Selection: Top samples selected by navigation score with balanced dataset distribution
  3. Stage 3 — Processing & Splitting: 81/19 stratified train/val split with base64 image embedding and context-aware question generation
  4. Stage 4 — Format Validation: 100% Qwen3-VL format compliance, zero errors

Sources (7 datasets):

  • Localized Narratives COCO (comprehensive spatial narratives)
  • Localized Narratives Flickr30k (detailed scene descriptions)
  • Visual Genome Combined (multi-region spatial relationships)
  • A-OKVQA (reasoning-based navigation analysis)
  • OK-VQA (knowledge-based environmental understanding)
  • GQA Enhanced (spatial reasoning Q&A)
  • VizWiz Enhanced (authentic blind user scenarios)

💻 Hardware Requirements

Use Case GPU RAM Storage
Inference (FP16) 6GB+ VRAM 16GB 8GB
Inference (8-bit) 4GB+ VRAM 8GB 6GB
Training (QLoRA) 16GB VRAM 32GB 50GB

Recommended for Inference: RTX A2000+ or equivalent


⚠️ Limitations

  1. Scope: Optimized for navigation descriptions; may underperform on general VQA tasks
  2. Spatial Precision: Uses relative terms ("nearby," "ahead") rather than quantitative distances
  3. Dynamic Elements: Limited tracking of moving objects (pedestrians, vehicles in motion)
  4. Indoor vs Outdoor: Stronger outdoor performance reflecting training data composition
  5. Lighting Conditions: May struggle with directional shadows, glare, or backlit subjects
  6. Temporary Hazards: Construction barriers, spilled liquids, and transient obstacles may receive insufficient emphasis
  7. Language: English only
  8. Speed: Requires GPU for real-time use (2–3s on GPU; slower on CPU)

Safety Notice

⚠️ This is an assistive tool, not a replacement for traditional navigation aids. Users should:

  • Combine with cane, guide dog, or other mobility aids
  • Exercise human judgment in all navigation decisions
  • Test in safe environments first
  • Be aware of potential errors and limitations

🎓 Model Card

Model Details

  • Type: Vision-Language Model (Qwen3-VL)
  • Architecture: Unified vision-language transformer with high-resolution ViT and multi-scale processing
  • Parameters: 2.16B total, ~34.9M trainable (1.6%)
  • Input: Image + Text
  • Output: Text (25–60 word navigation descriptions)
  • License: Apache 2.0

Intended Use

Primary:

  • Navigation assistance for blind/visually impaired users
  • Comprehensive scene description and spatial reasoning
  • Obstacle detection and safety-critical environment analysis
  • Text and sign recognition in natural environments
  • Accessibility and assistive technology research

Out of Scope:

  • Medical diagnosis
  • Autonomous navigation without human oversight
  • Real-time video processing
  • General-purpose VQA (use base model instead)

System Context

This model serves as the primary navigation description generator within a multi-model assistive system:

  • Qwen3-VL-2B (this model): Fine-tuned comprehensive navigation descriptions
  • Florence-2 Small: Specialized object detection and spatial reasoning (pre-trained)
  • FastVLM-0.5B-ONNX: Real-time preliminary scene assessment (pre-trained)

Ethical Considerations

  • Designed to enhance independence, not replace human judgment
  • May have biases from English-only training data
  • Requires validation in real-world navigation scenarios
  • Processes images locally (no data collection)
  • Trained on publicly available datasets with appropriate licenses

📖 Citation

@misc{amin2025qwen3vl_navigation,
  author = {Amin, Mohammad Mohamed Said Aly},
  title = {Qwen3-VL Navigation Assistant: Fine-tuned for Blind Navigation},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/msaid1976/Qwen3-VL-2B-Navigation-FineTuned}}
}

@mastersthesis{amin2025thesis,
  author = {Amin, Mohammad Mohamed Said Aly},
  title = {An Efficient Multi-Object Detection and Smart Navigation Using Vision Language Models for Visually Impaired},
  school = {Asia Pacific University of Technology and Innovation},
  year = {2025},
  address = {Kuala Lumpur, Malaysia}
}

🙏 Acknowledgments

Supervision:

  • Dr. Raheem Mafas (Research Supervisor)
  • Asia Pacific University of Technology and Innovation
  • raheem@apu.edu.my

Technical:

  • Alibaba Cloud (Qwen3-VL base model)
  • HuggingFace Team (model hosting & libraries)
  • Unsloth (optimized training framework)
  • NVIDIA (GPU hardware)

Datasets:

  • COCO
  • lickr30k
  • Stanford Visual Genome
  • OK-VQA
  • GQA
  • VizWiz (authentic blind user data)

📫 Contact

Author: Mohammad Mohamed Said Aly Amin
Institution: Asia Pacific University of Technology and Innovation
Email : TP079177@mail.apu.edu.my Issues: Model Discussions


Made with ❤️ for accessibility and inclusion

HuggingFace License

Empowering independence through AI-powered vision assistance

Downloads last month
84
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for msaid1976/Qwen3-VL-2B-FineTuned-Vision-Assistant

Finetuned
(147)
this model