Qwen3-VL Navigation Assistant 🦯
Fine-tuned vision-language model for blind navigation assistance
Quick Start • Performance • Usage • Training • Citation
📋 Overview
Fine-tuned Qwen3-VL-2B-Instruct for vision-based navigation assistance for blind and visually impaired users. The model generates comprehensive, navigation-focused scene descriptions (25–60 words) optimized for text-to-speech delivery during active navigation. Developed as a Master's thesis project at Asia Pacific University.
Key Results:
- 🎯 82.0% BERTScore (semantic accuracy)
- 🚀 +2,031% BLEU improvement over baseline
- 📏 Near-perfect length calibration (0.98 ratio to reference)
- 📊 p < 0.001 statistical significance
- 🏆 Grade: A− (4-level improvement from D baseline)
Author: Mohammad Mohamed Said Aly Amin
Supervisor: Dr. Raheem Mafas
Institution: Asia Pacific University of Technology and Innovation
Program: Master's in Data Science & Business Analytics
✨ Features
Comprehensive Navigation Descriptions
The model produces detailed, actionable descriptions (25–60 words) covering spatial relationships, obstacles, safety cues, and environmental context — optimized for blind navigation via text-to-speech.
| Capability | Description | Example Query |
|---|---|---|
| 🌍 Scene Description | Comprehensive environment narratives | "Describe this scene for a blind person." |
| 🎯 Spatial Reasoning | Object relationships & positioning | "What obstacles should I be aware of?" |
| 🧭 Navigation Context | Pathways, directions & safety cues | "Describe the navigation context of this scene." |
| 📝 Text Recognition | Signs, labels & written information | "What text or signs can you see?" |
Technical Highlights
- ✅ Comprehensive 25–60 word navigation descriptions
- ✅ Near-perfect length calibration (36.2 words vs 36.9 reference)
- ✅ Real-time inference on consumer GPUs (2–3s/image)
- ✅ Statistically validated improvements (p < 0.001)
- ✅ Parameter-efficient fine-tuning via LoRA (~1.6% trainable parameters)
- ✅ Part of multi-model navigation system (Qwen3-VL + Florence-2 + FastVLM)
📊 Performance
Evaluation Results (50 samples, paired comparison)
| Metric | Fine-tuned | Baseline | Improvement |
|---|---|---|---|
| BLEU | 0.164 | 0.008 | +2,031% 🚀 |
| ROUGE-1 | 39.30 | 19.37 | +103% |
| ROUGE-2 | 17.74 | 3.60 | +393% |
| ROUGE-L | 30.92 | 12.48 | +148% |
| METEOR | 0.335 | 0.211 | +59% |
| BERTScore F1 | 82.0 | 73.4 | +12% |
| Avg Response Length | 36.2 words | 92.8 words | −61% (optimal) |
| Length Ratio (vs ref) | 0.98 | 2.52 | Near-perfect ✅ |
| Performance Grade | A− | D | +4 levels 🏆 |
Statistical Validation: Paired t-test: t = 4.31, p = 7.88 × 10⁻⁵ (p << 0.05, n = 50)
Loss Convergence
| Epoch | Training Loss | Validation Loss | Train-Val Gap |
|---|---|---|---|
| 1 | 1.275 | 1.254 | 0.021 |
| 2 | 1.208 | 1.208 | 0.000 |
| 3 | 1.091 | 1.195 | 0.096 |
| 4 | 0.952 | 1.203 | 0.251 |
| 5 | 0.935 | 1.217 | 0.282 |
- Training Loss: 1.275 → 0.935 (26.7% reduction)
- Validation Loss: 1.254 → 1.217 (stable convergence)
- Total Training Time: ~1h 46m (3,295 steps)
🚀 Quick Start
Installation
pip install transformers torch pillow accelerate unsloth
Basic Usage
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch
# Load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"msaid1976/Qwen3-VL-2B-Navigation-FineTuned",
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
"msaid1976/Qwen3-VL-2B-Navigation-FineTuned",
trust_remote_code=True
)
# Prepare input
image = Image.open("scene.jpg")
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this scene for a blind person."}
]
}]
# Generate
prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=200,
do_sample=False,
pad_token_id=processor.tokenizer.eos_token_id,
eos_token_id=processor.tokenizer.eos_token_id
)
response = processor.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
💡 Usage Examples
Scene Description
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this scene for a blind person."}
]
}]
# Output: "A busy intersection with pedestrians waiting at the corner of a crosswalk.
# Traffic lights control vehicle movement and there are several cars stopped in
# adjacent lanes. Buildings with retail stores line both sides of the street,
# with wide sidewalks leading in multiple directions."
Safety & Obstacle Awareness
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "What safety concerns or obstacles should I be aware of?"}
]
}]
# Output: "There are several orange traffic cones placed along the sidewalk creating
# a narrow walking path, with construction barriers on the left side that you
# should stay away from for safety."
Navigation Context
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe the navigation context of this scene for someone who is blind."}
]
}]
# Output: "A long corridor extending straight ahead with doorways on both the left
# and right sides. The hallway is well-lit from overhead lights and the floor
# appears clear and even. There are approximately four doors visible before
# the corridor turns."
Text & Sign Reading
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "What text or signs can you see in this image?"}
]
}]
# Output: "The sign says 'EXIT' in red letters above the doorway on the right side."
Memory Optimization
# 8-bit quantization (reduces VRAM usage)
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"msaid1976/Qwen3-VL-2B-Navigation-FineTuned",
quantization_config=quantization_config,
device_map="auto",
trust_remote_code=True
)
🛠️ Training Details
Configuration
| Parameter | Value | Description |
|---|---|---|
| Base Model | Qwen3-VL-2B-Instruct | 2.16B total parameters |
| Framework | Unsloth | Optimized training framework |
| Method | LoRA | Low-Rank Adaptation |
| Quantization | 4-bit (QLoRA compatible) | Memory-efficient training |
| Trainable Params | LoRA adapters only | |
| LoRA Rank (r) | 32 | Adapter dimension |
| LoRA Alpha | 64 | Scaling factor (α/r = 2) |
| LoRA Dropout | 0.1 | Regularization |
| Target Modules | q, k, v, o, gate, up, down proj | Attention + MLP layers |
| Epochs | 5 | Full data passes |
| Batch Size | 2 (effective: 16) | With 8× gradient accumulation |
| Learning Rate | 5 × 10⁻⁵ | AdamW 8-bit optimizer |
| LR Schedule | Cosine with warmup | 10% warmup ratio |
| Weight Decay | 0.01 | L2 regularization |
| Max Grad Norm | 0.1 | Gradient clipping |
| Max Seq Length | 4,096 tokens | Context window |
| Precision | FP16 | Mixed precision training |
| GPU | RTX 5070 Ti 16GB | Training hardware |
| Training Time | ~1h 46m | 3,295 steps |
| Peak VRAM | ~14.2GB | During training |
Dataset
Size: 21,000 samples (17000 train / 4000 validation)
Average Response Length: 34 words (100% ≥ 25 words)
Format: Qwen3-VL native with embedded base64 images
Four-Stage Data Preparation Pipeline:
- Stage 1 — Intelligent Collection: Adaptive filtering with custom navigation relevance scoring across 7 source datasets
- Stage 2 — Quality Selection: Top samples selected by navigation score with balanced dataset distribution
- Stage 3 — Processing & Splitting: 81/19 stratified train/val split with base64 image embedding and context-aware question generation
- Stage 4 — Format Validation: 100% Qwen3-VL format compliance, zero errors
Sources (7 datasets):
- Localized Narratives COCO (comprehensive spatial narratives)
- Localized Narratives Flickr30k (detailed scene descriptions)
- Visual Genome Combined (multi-region spatial relationships)
- A-OKVQA (reasoning-based navigation analysis)
- OK-VQA (knowledge-based environmental understanding)
- GQA Enhanced (spatial reasoning Q&A)
- VizWiz Enhanced (authentic blind user scenarios)
💻 Hardware Requirements
| Use Case | GPU | RAM | Storage |
|---|---|---|---|
| Inference (FP16) | 6GB+ VRAM | 16GB | 8GB |
| Inference (8-bit) | 4GB+ VRAM | 8GB | 6GB |
| Training (QLoRA) | 16GB VRAM | 32GB | 50GB |
Recommended for Inference: RTX A2000+ or equivalent
⚠️ Limitations
- Scope: Optimized for navigation descriptions; may underperform on general VQA tasks
- Spatial Precision: Uses relative terms ("nearby," "ahead") rather than quantitative distances
- Dynamic Elements: Limited tracking of moving objects (pedestrians, vehicles in motion)
- Indoor vs Outdoor: Stronger outdoor performance reflecting training data composition
- Lighting Conditions: May struggle with directional shadows, glare, or backlit subjects
- Temporary Hazards: Construction barriers, spilled liquids, and transient obstacles may receive insufficient emphasis
- Language: English only
- Speed: Requires GPU for real-time use (2–3s on GPU; slower on CPU)
Safety Notice
⚠️ This is an assistive tool, not a replacement for traditional navigation aids. Users should:
- Combine with cane, guide dog, or other mobility aids
- Exercise human judgment in all navigation decisions
- Test in safe environments first
- Be aware of potential errors and limitations
🎓 Model Card
Model Details
- Type: Vision-Language Model (Qwen3-VL)
- Architecture: Unified vision-language transformer with high-resolution ViT and multi-scale processing
- Parameters: 2.16B total, ~34.9M trainable (1.6%)
- Input: Image + Text
- Output: Text (25–60 word navigation descriptions)
- License: Apache 2.0
Intended Use
Primary:
- Navigation assistance for blind/visually impaired users
- Comprehensive scene description and spatial reasoning
- Obstacle detection and safety-critical environment analysis
- Text and sign recognition in natural environments
- Accessibility and assistive technology research
Out of Scope:
- Medical diagnosis
- Autonomous navigation without human oversight
- Real-time video processing
- General-purpose VQA (use base model instead)
System Context
This model serves as the primary navigation description generator within a multi-model assistive system:
- Qwen3-VL-2B (this model): Fine-tuned comprehensive navigation descriptions
- Florence-2 Small: Specialized object detection and spatial reasoning (pre-trained)
- FastVLM-0.5B-ONNX: Real-time preliminary scene assessment (pre-trained)
Ethical Considerations
- Designed to enhance independence, not replace human judgment
- May have biases from English-only training data
- Requires validation in real-world navigation scenarios
- Processes images locally (no data collection)
- Trained on publicly available datasets with appropriate licenses
📖 Citation
@misc{amin2025qwen3vl_navigation,
author = {Amin, Mohammad Mohamed Said Aly},
title = {Qwen3-VL Navigation Assistant: Fine-tuned for Blind Navigation},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/msaid1976/Qwen3-VL-2B-Navigation-FineTuned}}
}
@mastersthesis{amin2025thesis,
author = {Amin, Mohammad Mohamed Said Aly},
title = {An Efficient Multi-Object Detection and Smart Navigation Using Vision Language Models for Visually Impaired},
school = {Asia Pacific University of Technology and Innovation},
year = {2025},
address = {Kuala Lumpur, Malaysia}
}
🙏 Acknowledgments
Supervision:
- Dr. Raheem Mafas (Research Supervisor)
- Asia Pacific University of Technology and Innovation
- raheem@apu.edu.my
Technical:
- Alibaba Cloud (Qwen3-VL base model)
- HuggingFace Team (model hosting & libraries)
- Unsloth (optimized training framework)
- NVIDIA (GPU hardware)
Datasets:
- COCO
- lickr30k
- Stanford Visual Genome
- OK-VQA
- GQA
- VizWiz (authentic blind user data)
📫 Contact
Author: Mohammad Mohamed Said Aly Amin
Institution: Asia Pacific University of Technology and Innovation
Email : TP079177@mail.apu.edu.my
Issues: Model Discussions
- Downloads last month
- 84
Model tree for msaid1976/Qwen3-VL-2B-FineTuned-Vision-Assistant
Base model
Qwen/Qwen3-VL-2B-Instruct