Qwen3-VL Navigation Assistant 🦯

Fine-tuned vision-language model for blind navigation assistance

Quick Start • Performance • Usage • Training • Citation

📋 Overview

Fine-tuned Qwen3-VL-2B-Instruct for vision-based navigation assistance for blind and visually impaired users. The model generates comprehensive, navigation-focused scene descriptions (25–60 words) optimized for text-to-speech delivery during active navigation. Developed as a Master's thesis project at Asia Pacific University.

Key Results:

🎯 82.0% BERTScore (semantic accuracy)
🚀 +2,031% BLEU improvement over baseline
📏 Near-perfect length calibration (0.98 ratio to reference)
📊 p < 0.001 statistical significance
🏆 Grade: A− (4-level improvement from D baseline)

Author: Mohammad Mohamed Said Aly Amin
Supervisor: Dr. Raheem Mafas
Institution: Asia Pacific University of Technology and Innovation
Program: Master's in Data Science & Business Analytics

✨ Features

Comprehensive Navigation Descriptions

The model produces detailed, actionable descriptions (25–60 words) covering spatial relationships, obstacles, safety cues, and environmental context — optimized for blind navigation via text-to-speech.

Capability	Description	Example Query
🌍 Scene Description	Comprehensive environment narratives	"Describe this scene for a blind person."
🎯 Spatial Reasoning	Object relationships & positioning	"What obstacles should I be aware of?"
🧭 Navigation Context	Pathways, directions & safety cues	"Describe the navigation context of this scene."
📝 Text Recognition	Signs, labels & written information	"What text or signs can you see?"

Technical Highlights

✅ Comprehensive 25–60 word navigation descriptions
✅ Near-perfect length calibration (36.2 words vs 36.9 reference)
✅ Real-time inference on consumer GPUs (2–3s/image)
✅ Statistically validated improvements (p < 0.001)
✅ Parameter-efficient fine-tuning via LoRA (~1.6% trainable parameters)
✅ Part of multi-model navigation system (Qwen3-VL + Florence-2 + FastVLM)

📊 Performance

Evaluation Results (50 samples, paired comparison)

Metric	Fine-tuned	Baseline	Improvement
BLEU	0.164	0.008	+2,031% 🚀
ROUGE-1	39.30	19.37	+103%
ROUGE-2	17.74	3.60	+393%
ROUGE-L	30.92	12.48	+148%
METEOR	0.335	0.211	+59%
BERTScore F1	82.0	73.4	+12%
Avg Response Length	36.2 words	92.8 words	−61% (optimal)
Length Ratio (vs ref)	0.98	2.52	Near-perfect ✅
Performance Grade	A−	D	+4 levels 🏆

Statistical Validation: Paired t-test: t = 4.31, p = 7.88 × 10⁻⁵ (p << 0.05, n = 50)

Loss Convergence

Epoch	Training Loss	Validation Loss	Train-Val Gap
1	1.275	1.254	0.021
2	1.208	1.208	0.000
3	1.091	1.195	0.096
4	0.952	1.203	0.251
5	0.935	1.217	0.282

Training Loss: 1.275 → 0.935 (26.7% reduction)
Validation Loss: 1.254 → 1.217 (stable convergence)
Total Training Time: ~1h 46m (3,295 steps)

🚀 Quick Start

Installation

pip install transformers torch pillow accelerate unsloth

Basic Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "msaid1976/Qwen3-VL-2B-Navigation-FineTuned",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
    "msaid1976/Qwen3-VL-2B-Navigation-FineTuned",
    trust_remote_code=True
)

# Prepare input
image = Image.open("scene.jpg")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Describe this scene for a blind person."}
    ]
}]

# Generate
prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=False,
        pad_token_id=processor.tokenizer.eos_token_id,
        eos_token_id=processor.tokenizer.eos_token_id
    )

response = processor.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

💡 Usage Examples

Scene Description

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Describe this scene for a blind person."}
    ]
}]
# Output: "A busy intersection with pedestrians waiting at the corner of a crosswalk.
# Traffic lights control vehicle movement and there are several cars stopped in
# adjacent lanes. Buildings with retail stores line both sides of the street,
# with wide sidewalks leading in multiple directions."

Safety & Obstacle Awareness

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "What safety concerns or obstacles should I be aware of?"}
    ]
}]
# Output: "There are several orange traffic cones placed along the sidewalk creating
# a narrow walking path, with construction barriers on the left side that you
# should stay away from for safety."

Navigation Context

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Describe the navigation context of this scene for someone who is blind."}
    ]
}]
# Output: "A long corridor extending straight ahead with doorways on both the left
# and right sides. The hallway is well-lit from overhead lights and the floor
# appears clear and even. There are approximately four doors visible before
# the corridor turns."

Text & Sign Reading

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "What text or signs can you see in this image?"}
    ]
}]
# Output: "The sign says 'EXIT' in red letters above the doorway on the right side."

Memory Optimization

# 8-bit quantization (reduces VRAM usage)
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "msaid1976/Qwen3-VL-2B-Navigation-FineTuned",
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True
)

🛠️ Training Details

Configuration

Parameter	Value	Description
Base Model	Qwen3-VL-2B-Instruct	2.16B total parameters
Framework	Unsloth	Optimized training framework
Method	LoRA	Low-Rank Adaptation
Quantization	4-bit (QLoRA compatible)	Memory-efficient training
Trainable Params	~~34.9M (~~1.6%)	LoRA adapters only
LoRA Rank (r)	32	Adapter dimension
LoRA Alpha	64	Scaling factor (α/r = 2)
LoRA Dropout	0.1	Regularization
Target Modules	q, k, v, o, gate, up, down proj	Attention + MLP layers
Epochs	5	Full data passes
Batch Size	2 (effective: 16)	With 8× gradient accumulation
Learning Rate	5 × 10⁻⁵	AdamW 8-bit optimizer
LR Schedule	Cosine with warmup	10% warmup ratio
Weight Decay	0.01	L2 regularization
Max Grad Norm	0.1	Gradient clipping
Max Seq Length	4,096 tokens	Context window
Precision	FP16	Mixed precision training
GPU	RTX 5070 Ti 16GB	Training hardware
Training Time	~1h 46m	3,295 steps
Peak VRAM	~14.2GB	During training

Dataset

Size: 21,000 samples (17000 train / 4000 validation)
Average Response Length: 34 words (100% ≥ 25 words)
Format: Qwen3-VL native with embedded base64 images

Four-Stage Data Preparation Pipeline:

Stage 1 — Intelligent Collection: Adaptive filtering with custom navigation relevance scoring across 7 source datasets
Stage 2 — Quality Selection: Top samples selected by navigation score with balanced dataset distribution
Stage 3 — Processing & Splitting: 81/19 stratified train/val split with base64 image embedding and context-aware question generation
Stage 4 — Format Validation: 100% Qwen3-VL format compliance, zero errors

Sources (7 datasets):

Localized Narratives COCO (comprehensive spatial narratives)
Localized Narratives Flickr30k (detailed scene descriptions)
Visual Genome Combined (multi-region spatial relationships)
A-OKVQA (reasoning-based navigation analysis)
OK-VQA (knowledge-based environmental understanding)
GQA Enhanced (spatial reasoning Q&A)
VizWiz Enhanced (authentic blind user scenarios)

💻 Hardware Requirements

Use Case	GPU	RAM	Storage
Inference (FP16)	6GB+ VRAM	16GB	8GB
Inference (8-bit)	4GB+ VRAM	8GB	6GB
Training (QLoRA)	16GB VRAM	32GB	50GB

Recommended for Inference: RTX A2000+ or equivalent

⚠️ Limitations

Scope: Optimized for navigation descriptions; may underperform on general VQA tasks
Spatial Precision: Uses relative terms ("nearby," "ahead") rather than quantitative distances
Dynamic Elements: Limited tracking of moving objects (pedestrians, vehicles in motion)
Indoor vs Outdoor: Stronger outdoor performance reflecting training data composition
Lighting Conditions: May struggle with directional shadows, glare, or backlit subjects
Temporary Hazards: Construction barriers, spilled liquids, and transient obstacles may receive insufficient emphasis
Language: English only
Speed: Requires GPU for real-time use (2–3s on GPU; slower on CPU)

Safety Notice

⚠️ This is an assistive tool, not a replacement for traditional navigation aids. Users should:

Combine with cane, guide dog, or other mobility aids
Exercise human judgment in all navigation decisions
Test in safe environments first
Be aware of potential errors and limitations

🎓 Model Card

Model Details

Type: Vision-Language Model (Qwen3-VL)
Architecture: Unified vision-language transformer with high-resolution ViT and multi-scale processing
Parameters: 2.16B total, ~34.9M trainable (1.6%)
Input: Image + Text
Output: Text (25–60 word navigation descriptions)
License: Apache 2.0

Intended Use

Primary:

Navigation assistance for blind/visually impaired users
Comprehensive scene description and spatial reasoning
Obstacle detection and safety-critical environment analysis
Text and sign recognition in natural environments
Accessibility and assistive technology research

Out of Scope:

Medical diagnosis
Autonomous navigation without human oversight
Real-time video processing
General-purpose VQA (use base model instead)

System Context

This model serves as the primary navigation description generator within a multi-model assistive system:

Qwen3-VL-2B (this model): Fine-tuned comprehensive navigation descriptions
Florence-2 Small: Specialized object detection and spatial reasoning (pre-trained)
FastVLM-0.5B-ONNX: Real-time preliminary scene assessment (pre-trained)

Ethical Considerations

Designed to enhance independence, not replace human judgment
May have biases from English-only training data
Requires validation in real-world navigation scenarios
Processes images locally (no data collection)
Trained on publicly available datasets with appropriate licenses

📖 Citation

@misc{amin2025qwen3vl_navigation,
  author = {Amin, Mohammad Mohamed Said Aly},
  title = {Qwen3-VL Navigation Assistant: Fine-tuned for Blind Navigation},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/msaid1976/Qwen3-VL-2B-Navigation-FineTuned}}
}

@mastersthesis{amin2025thesis,
  author = {Amin, Mohammad Mohamed Said Aly},
  title = {An Efficient Multi-Object Detection and Smart Navigation Using Vision Language Models for Visually Impaired},
  school = {Asia Pacific University of Technology and Innovation},
  year = {2025},
  address = {Kuala Lumpur, Malaysia}
}

🙏 Acknowledgments

Supervision:

Dr. Raheem Mafas (Research Supervisor)
Asia Pacific University of Technology and Innovation
raheem@apu.edu.my

Technical:

Alibaba Cloud (Qwen3-VL base model)
HuggingFace Team (model hosting & libraries)
Unsloth (optimized training framework)
NVIDIA (GPU hardware)

Datasets:

COCO
lickr30k
Stanford Visual Genome
OK-VQA
GQA
VizWiz (authentic blind user data)

📫 Contact

Author: Mohammad Mohamed Said Aly Amin
Institution: Asia Pacific University of Technology and Innovation
Email : TP079177@mail.apu.edu.my Issues: Model Discussions

Made with ❤️ for accessibility and inclusion

Empowering independence through AI-powered vision assistance

Downloads last month: 84

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for msaid1976/Qwen3-VL-2B-FineTuned-Vision-Assistant

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

(147)

this model