Update README.md

Browse files

Files changed (1) hide show

README.md +297 -130

README.md CHANGED Viewed

@@ -1,202 +1,369 @@
 ---
 library_name: transformers
-language:
-- en
-base_model:
-- HuggingFaceTB/SmolVLM-Instruct
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [Mohamed Mohamed Said Aly Amin]
-- **Funded by [optional]:** [APU - Asia Pacific University]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [Multi-Modal]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+language: en
+license: apache-2.0
+base_model: HuggingFaceTB/SmolVLM-500M-Instruct
 library_name: transformers
+pipeline_tag: image-text-to-text
+tags:
+- Vision
+- Image-to-text
+- Multimodal
+- Vision-language-model
+- Navigation
+- Accessibility
+- Assistive-technology
+- Blind-assistance
+- Fine-tuned
+- SmolVLM
 ---
+# SmolVLM Navigation Assistant 🦯
+<div align="center">
+[![Model](https://img.shields.io/badge/Model-SmolVLM--500M-blue)](https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct)
+[![License](https://img.shields.io/badge/License-Apache%202.0-green)](https://www.apache.org/licenses/LICENSE-2.0)
+[![BERTScore](https://img.shields.io/badge/BERTScore-91.6%25-brightgreen)](https://huggingface.co/metrics/bertscore)
+**Fine-tuned vision-language model for blind navigation assistance**
+[Quick Start](#-quick-start) • [Performance](#-performance) • [Usage](#-usage) • [Training](#-training-details) • [Citation](#-citation)
+</div>
+---
+## 📋 Overview
+Fine-tuned [SmolVLM-500M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct) for **vision-based navigation assistance** for blind and visually impaired users. Developed as a Master's thesis project at Asia Pacific University.
+**Key Results:**
+- 🎯 **91.6% BERTScore** (semantic accuracy)
+- 🚀 **+3483% BLEU-1** improvement over baseline
+- ⚡ **0.5-1s inference** time
+- 💾 **2-4GB VRAM** requirement
+- 📊 **p < 0.001** statistical significance
+**Author:** Mohammad Mohamed Said Aly Amin
+**Supervisor:** Dr. Raheem Mafas
+**Institution:** Asia Pacific University
+**Program:** Master's in Data Science & Business Analytics
+---
+## ✨ Features
+### Three Navigation Modes
+| Mode | Purpose | Response Length | Example Query |
+|------|---------|-----------------|---------------|
+| **🎯 FOCUSED** | Spatial relationships | 5-15 words | "Is there a chair to my left?" |
+| **🌍 SCENE** | Environment description | 30-50 words | "Describe what's in front of me" |
+| **📝 OCR** | Text recognition | Variable | "What does the sign say?" |
+### Technical Highlights
+- ✅ Real-time inference on consumer GPUs
+- ✅ Low memory footprint (2-4GB VRAM)
+- ✅ Statistically validated improvements
+- ✅ Production-ready deployment
+- ✅ QLoRA efficient fine-tuning (1.84% parameters)
+---
+## 📊 Performance
+### Evaluation Results (500 samples)
+| Metric | Fine-tuned | Baseline | Improvement |
+|--------|-----------|----------|-------------|
+| **BLEU** | 0.234 | - | - |
+| **BLEU-1** | 24.89 | 0.69 | **+3483%** 🚀 |
+| **ROUGE-1** | 55.72 | 13.66 | **+308%** |
+| **ROUGE-2** | 32.46 | 2.69 | **+1105%** |
+| **ROUGE-L** | 48.27 | 11.82 | **+308%** |
+| **BERTScore** | 91.63 | 85.60 | **+7.04%** |
+| **Length Ratio** | 0.93 | - | Nearly perfect |
+**Statistical Validation:** All improvements significant at p < 0.001 (paired t-test, n=500)
+### Loss Convergence
+- Initial Training Loss: **0.29** → Final: **0.12** (58% reduction)
+- Initial Val Loss: **0.24** → Final: **0.13** (46% reduction)
+---
+## 🚀 Quick Start
+### Installation
+```bash
+pip install transformers torch pillow accelerate
+```
+### Basic Usage
+```python
+from transformers import Idefics3ForConditionalGeneration, AutoProcessor
+from PIL import Image
+import torch
+# Load model
+model = Idefics3ForConditionalGeneration.from_pretrained(
+    "msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
+    torch_dtype=torch.float16,
+    device_map="auto",
+    trust_remote_code=True
+)
+processor = AutoProcessor.from_pretrained(
+    "msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
+    trust_remote_code=True
+)
+# Prepare input
+image = Image.open("scene.jpg")
+messages = [{
+    "role": "user",
+    "content": [
+        {"type": "image"},
+        {"type": "text", "text": "What do you see?"}
+    ]
+}]
+# Generate
+prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
+inputs = processor(text=prompt, images=[image], return_tensors="pt")
+inputs = {k: v.to("cuda") for k, v in inputs.items()}
+with torch.no_grad():
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=150,
+        do_sample=False,
+        pad_token_id=processor.tokenizer.eos_token_id,
+        eos_token_id=processor.tokenizer.eos_token_id
+    )
+response = processor.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
+print(response)
+```
+---
+## 💡 Usage Examples
+### FOCUSED: Spatial Queries
+```python
+messages = [{
+    "role": "user",
+    "content": [
+        {"type": "image"},
+        {"type": "text", "text": "Is there a chair to the left of the table?"}
+    ]
+}]
+# Output: "Yes, there is a chair to the left of the table."
+```
+### SCENE: Environment Description
+```python
+messages = [{
+    "role": "user",
+    "content": [
+        {"type": "image"},
+        {"type": "text", "text": "Describe the scene in front of me."}
+    ]
+}]
+# Output: "The scene shows a living room with a brown sofa on the left,
+# a wooden coffee table in the center, and a TV on the wall..."
+```
+### OCR: Text Reading
+```python
+messages = [{
+    "role": "user",
+    "content": [
+        {"type": "image"},
+        {"type": "text", "text": "What text is on the sign?"}
+    ]
+}]
+# Output: "The sign says 'EXIT' in red letters."
+```
+### Memory Optimization
+```python
+# 8-bit quantization (reduces to ~2GB VRAM)
+model = Idefics3ForConditionalGeneration.from_pretrained(
+    "msaid1976/SmolVLM-Instruct-Navigation-FineTuned",
+    load_in_8bit=True,
+    device_map="auto"
+)
+# Batch processing
+inputs = processor(
+    text=[prompt1, prompt2, prompt3],
+    images=[[img1], [img2], [img3]],
+    return_tensors="pt",
+    padding=True
+)
+```
+---
+## 🛠️ Training Details
+### Configuration
+| Parameter | Value | Description |
+|-----------|-------|-------------|
+| **Base Model** | SmolVLM-500M-Instruct | 500M parameters |
+| **Method** | QLoRA | 4-bit quantization |
+| **Trainable Params** | 42M (1.84%) | LoRA adapters only |
+| **LoRA Rank** | 32 | Adapter dimension |
+| **LoRA Alpha** | 64 | Scaling factor |
+| **Epochs** | 3 | Full data passes |
+| **Batch Size** | 1 (effective: 16) | With gradient accumulation |
+| **Learning Rate** | 2e-5 | AdamW optimizer |
+| **Precision** | BF16 | Mixed precision |
+| **GPU** | RTX 5070 Ti 16GB | Training hardware |
+| **Training Time** | ~20 hours | Total duration |
+| **Peak VRAM** | 7-9GB | During training |
+### Dataset
+**Size:** 10,000+ samples across three modes
+**Sources:**
+- GQA Enhanced (spatial reasoning)
+- Localized Narratives (scene descriptions)
+- Visual Genome (object relationships)
+- TextCaps (text-in-image)
+- VizWiz (accessibility focus)
+---
+## 💻 Hardware Requirements
+| Use Case | GPU | RAM | Storage |
+|----------|-----|-----|---------|
+| **Inference** | 4GB+ VRAM | 8GB | 5GB |
+| **Training** | 16GB VRAM | 32GB | 50GB |
+**Recommended for Inference:** RTX 3060+ or equivalent
+---
+## ⚠️ Limitations
+1. **Scope:** Optimized for navigation; may underperform on general VQA
+2. **Image Quality:** Best with well-lit, clear images
+3. **OCR:** Works best with printed text; struggles with handwriting
+4. **Speed:** Requires GPU for real-time use (CPU: 10-20s/image)
+5. **Language:** English only
+### Safety Notice
+⚠️ **This is an assistive tool, not a replacement for traditional navigation aids.** Users should:
+- Combine with cane, guide dog, or other mobility aids
+- Exercise human judgment
+- Test in safe environments first
+- Be aware of potential errors
+---
+## 🎓 Model Card
+### Model Details
+- **Type:** Vision-Language Model (Idefics3)
+- **Parameters:** 500M total, 42M trainable (1.84%)
+- **Input:** Image + Text
+- **Output:** Text
+- **License:** Apache 2.0
+### Intended Use
+**Primary:**
+- Navigation assistance for blind/visually impaired
+- Spatial reasoning and object localization
+- Scene understanding and description
+- Text recognition in natural environments
+- Accessibility research
+**Out of Scope:**
+- Medical diagnosis
+- Autonomous navigation without human oversight
+- Real-time video processing
+- General-purpose VQA (use base model)
+### Ethical Considerations
+- Designed to enhance independence, not replace human judgment
+- May have biases from English-only training data
+- Requires validation in real-world scenarios
+- Processes images locally (no data collection)
+---
+## 📖 Citation
+```bibtex
+@misc{alqahtani2025smolvlm_navigation,
+  author = {Alqahtani, Muhammad Said},
+  title = {SmolVLM Navigation Assistant: Fine-tuned for Blind Navigation},
+  year = {2025},
+  publisher = {HuggingFace},
+  howpublished = {\url{https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned}}
+}
+@mastersthesis{alqahtani2025thesis,
+  author = {Alqahtani, Muhammad Said},
+  title = {An Efficient Multi-Object Detection and Smart Navigation Using Vision Language Models for Visually Impaired},
+  school = {Asia Pacific University of Technology and Innovation},
+  year = {2025},
+  address = {Kuala Lumpur, Malaysia}
+}
+```
+---
+## 🙏 Acknowledgments
+**Supervision:**
+- Dr. Raheem Mafas (Research Supervisor)
+- Asia Pacific University
+**Technical:**
+- HuggingFace Team (base model & libraries)
+- Unsloth (training framework)
+- NVIDIA (GPU hardware)
+**Datasets:**
+- Stanford Visual Genome
+- GQA, VizWiz, TextCaps
+- Localized Narratives
+---
+## 📫 Contact
+**Author:** Mohammad Mohamed Said Aly Amin
+**Institution:** Asia Pacific University
+**Issues:** [Model Discussions](https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned/discussions)
+---
+<div align="center">
+**Made with ❤️ for accessibility and inclusion**
+[![HuggingFace](https://img.shields.io/badge/🤗-Model-yellow)](https://huggingface.co/msaid1976/SmolVLM-Instruct-Navigation-FineTuned)
+[![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)
+*Empowering independence through AI-powered vision assistance*
+</div>