Instructions to use msaid1976/SmolVLM-Instruct-Navigation-FineTuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use msaid1976/SmolVLM-Instruct-Navigation-FineTuned with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="msaid1976/SmolVLM-Instruct-Navigation-FineTuned")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("msaid1976/SmolVLM-Instruct-Navigation-FineTuned", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use msaid1976/SmolVLM-Instruct-Navigation-FineTuned with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "msaid1976/SmolVLM-Instruct-Navigation-FineTuned"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned

SGLang

How to use msaid1976/SmolVLM-Instruct-Navigation-FineTuned with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "msaid1976/SmolVLM-Instruct-Navigation-FineTuned" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "msaid1976/SmolVLM-Instruct-Navigation-FineTuned" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use msaid1976/SmolVLM-Instruct-Navigation-FineTuned with Docker Model Runner:
```
docker model run hf.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned
```

SmolVLM Navigation Assistant 🦯

Fine-tuned vision-language model for blind navigation assistance

Quick Start • Performance • Usage • Training • Citation

📋 Overview

Fine-tuned SmolVLM-500M-Instruct for vision-based navigation assistance for blind and visually impaired users. Developed as a Master's thesis project at Asia Pacific University.

Key Results:

🎯 91.6% BERTScore (semantic accuracy)
🚀 +3483% BLEU-1 improvement over baseline
⚡ 0.5-1s inference time
💾 2-4GB VRAM requirement
📊 p < 0.001 statistical significance

Author: Mohammad Mohamed Said Aly Amin
Supervisor: Dr. Raheem Mafas
Institution: Asia Pacific University
Program: Master's in Data Science & Business Analytics

✨ Features

Three Navigation Modes

Mode	Purpose	Response Length	Example Query
🎯 FOCUSED	Spatial relationships	5-15 words	"Is there a chair to my left?"
🌍 SCENE	Environment description	30-50 words	"Describe what's in front of me"
📝 OCR	Text recognition	Variable	"What does the sign say?"

Technical Highlights

✅ Real-time inference on consumer GPUs
✅ Low memory footprint (2-4GB VRAM)
✅ Statistically validated improvements
✅ Production-ready deployment
✅ QLoRA efficient fine-tuning (1.84% parameters)

📊 Performance

Evaluation Results (500 samples)

Metric	Fine-tuned	Baseline	Improvement
BLEU	0.234	-	-
BLEU-1	24.89	0.69	+3483% 🚀
ROUGE-1	55.72	13.66	+308%
ROUGE-2	32.46	2.69	+1105%
ROUGE-L	48.27	11.82	+308%
BERTScore	91.63	85.60	+7.04%
Length Ratio	0.93	-	Nearly perfect

Statistical Validation: All improvements significant at p < 0.001 (paired t-test, n=500)

Loss Convergence

Initial Training Loss: 0.29 → Final: 0.12 (58% reduction)
Initial Val Loss: 0.24 → Final: 0.13 (46% reduction)

🚀 Quick Start

Installation

pip install transformers torch pillow accelerate

Basic Usage

from transformers import Idefics3ForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Load model
model = Idefics3ForConditionalGeneration.from_pretrained(
    "msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
    "msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
    trust_remote_code=True
)

# Prepare input
image = Image.open("scene.jpg")
messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "What do you see?"}
    ]
}]

# Generate
prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        do_sample=False,
        pad_token_id=processor.tokenizer.eos_token_id,
        eos_token_id=processor.tokenizer.eos_token_id
    )

response = processor.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

💡 Usage Examples

FOCUSED: Spatial Queries

messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "Is there a chair to the left of the table?"}
    ]
}]
# Output: "Yes, there is a chair to the left of the table."

SCENE: Environment Description

messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe the scene in front of me."}
    ]
}]
# Output: "The scene shows a living room with a brown sofa on the left, 
# a wooden coffee table in the center, and a TV on the wall..."

OCR: Text Reading

messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "What text is on the sign?"}
    ]
}]
# Output: "The sign says 'EXIT' in red letters."

Memory Optimization

# 8-bit quantization (reduces to ~2GB VRAM)
model = Idefics3ForConditionalGeneration.from_pretrained(
    "msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
    load_in_8bit=True,
    device_map="auto"
)

# Batch processing
inputs = processor(
    text=[prompt1, prompt2, prompt3],
    images=[[img1], [img2], [img3]],
    return_tensors="pt",
    padding=True
)

🛠️ Training Details

Configuration

Parameter	Value	Description
Base Model	SmolVLM-500M-Instruct	500M parameters
Method	QLoRA	4-bit quantization
Trainable Params	42M (1.84%)	LoRA adapters only
LoRA Rank	32	Adapter dimension
LoRA Alpha	64	Scaling factor
Epochs	3	Full data passes
Batch Size	1 (effective: 16)	With gradient accumulation
Learning Rate	2e-5	AdamW optimizer
Precision	BF16	Mixed precision
GPU	RTX 5070 Ti 16GB	Training hardware
Training Time	~20 hours	Total duration
Peak VRAM	7-9GB	During training

Dataset

Size: 10,000+ samples across three modes

Sources:

GQA Enhanced (spatial reasoning)
Localized Narratives (scene descriptions)
Visual Genome (object relationships)
TextCaps (text-in-image)
VizWiz (accessibility focus)

💻 Hardware Requirements

Use Case	GPU	RAM	Storage
Inference	4GB+ VRAM	8GB	5GB
Training	16GB VRAM	32GB	50GB

Recommended for Inference: RTX 3060+ or equivalent

⚠️ Limitations

Scope: Optimized for navigation; may underperform on general VQA
Image Quality: Best with well-lit, clear images
OCR: Works best with printed text; struggles with handwriting
Speed: Requires GPU for real-time use (CPU: 10-20s/image)
Language: English only

Safety Notice

⚠️ This is an assistive tool, not a replacement for traditional navigation aids. Users should:

Combine with cane, guide dog, or other mobility aids
Exercise human judgment
Test in safe environments first
Be aware of potential errors

🎓 Model Card

Model Details

Type: Vision-Language Model (Idefics3)
Parameters: 500M total, 42M trainable (1.84%)
Input: Image + Text
Output: Text
License: Apache 2.0

Intended Use

Primary:

Navigation assistance for blind/visually impaired
Spatial reasoning and object localization
Scene understanding and description
Text recognition in natural environments
Accessibility research

Out of Scope:

Medical diagnosis
Autonomous navigation without human oversight
Real-time video processing
General-purpose VQA (use base model)

Ethical Considerations

Designed to enhance independence, not replace human judgment
May have biases from English-only training data
Requires validation in real-world scenarios
Processes images locally (no data collection)

📖 Citation

@misc{alqahtani2025smolvlm_navigation,
  author = {Alqahtani, Muhammad Said},
  title = {SmolVLM Navigation Assistant: Fine-tuned for Blind Navigation},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned}}
}

@mastersthesis{alqahtani2025thesis,
  author = {Alqahtani, Muhammad Said},
  title = {An Efficient Multi-Object Detection and Smart Navigation Using Vision Language Models for Visually Impaired},
  school = {Asia Pacific University of Technology and Innovation},
  year = {2025},
  address = {Kuala Lumpur, Malaysia}
}