|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-VL-3B-Instruct |
|
|
tags: |
|
|
- VLM |
|
|
- LLM |
|
|
- DriveFusion |
|
|
- Vision |
|
|
- MultiModal |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
# DriveFusionQA Model Card |
|
|
|
|
|
<div align="center"> |
|
|
<img src="drivefusion_logo.png" alt="DriveFusion Logo" width="300"/> |
|
|
<h1>DriveFusionQA</h1> |
|
|
<p><strong>An Autonomous Driving Vision-Language Model for Scenario Understanding & Decision Reasoning.</strong></p> |
|
|
|
|
|
[](https://opensource.org/licenses/Apache-2.0) |
|
|
[](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) |
|
|
[]() |
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Description |
|
|
|
|
|
**DriveFusionQA** is a specialized Vision-Language Model (VLM) fine-tuned to interpret complex driving scenes and explain vehicle decision-making. Built on the **Qwen2.5-VL** architecture, it bridges the gap between raw sensor data and human-understandable reasoning. |
|
|
|
|
|
Unlike general-purpose models, DriveFusionQA is specifically optimized to answer the "why" behind driving maneuvers, making it an essential tool for safety analysis, simulation, and interactive driving support. |
|
|
|
|
|
## π GitHub Repository |
|
|
|
|
|
Find the full implementation, training scripts, and preprocessing logic here: |
|
|
* **Main Model Code:** [DriveFusion/drivefusion](https://github.com/DriveFusion/drivefusion) |
|
|
* **Data Pipeline:** [DriveFusion/data-preprocessing](https://github.com/DriveFusion/data-preprocessing) |
|
|
|
|
|
### Core Capabilities |
|
|
* **Scenario Explanation:** Identifies traffic participants, road signs, and environmental hazards. |
|
|
* **Decision Reasoning:** Justifies driving actions (e.g., "Braking due to a pedestrian entering the crosswalk"). |
|
|
* **Multi-Dataset Expertise:** Leverages a unified pipeline of world-class driving benchmarks. |
|
|
* **Interactive Dialogue:** Supports multi-turn conversations regarding road safety and navigation. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Performance |
|
|
|
|
|
DriveFusionQA demonstrates significant improvements over the base model across all key driving-related language metrics. The substantial increase in **Lingo-Judge** scores reflects its superior ability to generate human-aligned driving reasoning. |
|
|
|
|
|
| Model | Lingo-Judge | METEOR | CIDEr | BLEU | |
|
|
| :--- | :---: | :---: | :---: | :---: | |
|
|
| **DriveFusion QA** | **53.2** | **0.3327** | **0.1602** | **0.0853** | |
|
|
| Qwen2.5-VL Base | 38.1 | 0.2577 | 0.1024 | 0.0259 | |
|
|
|
|
|
--- |
|
|
|
|
|
## π Training & Data |
|
|
|
|
|
The model was trained using the [DriveFusion Data Preprocessing](https://github.com/DriveFusion/data-preprocessing) pipeline, which standardizes diverse autonomous driving datasets into a unified format. |
|
|
|
|
|
**Key Datasets Included:** |
|
|
* **LingoQA:** Action-focused scenery and decision components. |
|
|
* **DriveGPT4 + BDD-X:** Human-like driving explanations and logic. |
|
|
* **DriveLM:** Graph-based reasoning for autonomous driving. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Quick Start |
|
|
|
|
|
Ensure you have the latest `transformers` library installed to support the Qwen2.5-VL architecture. |
|
|
|
|
|
### Installation |
|
|
```bash |
|
|
pip install transformers accelerate pillow torch |
|
|
``` |
|
|
|
|
|
### Inference Example |
|
|
```python |
|
|
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor |
|
|
from PIL import Image |
|
|
import torch |
|
|
|
|
|
model_id = "DriveFusion/DriveFusionQA" |
|
|
model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
|
|
model_id, torch_dtype="auto", device_map="auto" |
|
|
) |
|
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
|
|
|
# Load driving scene |
|
|
image = Image.open("driving_sample.jpg") |
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image", "image": image}, |
|
|
{"type": "text", "text": "Describe the current driving scenario and any potential risks."}, |
|
|
], |
|
|
} |
|
|
] |
|
|
|
|
|
# Generate Response |
|
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda") |
|
|
|
|
|
output_ids = model.generate(**inputs, max_new_tokens=256) |
|
|
response = processor.batch_decode(output_ids, skip_special_tokens=True) |
|
|
print(response[0]) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Intended Use |
|
|
* **Safety Analysis:** Generating natural language reports for dashcam footage and near-miss events. |
|
|
* **Training & Simulation:** Providing ground-truth explanations for AI driver training. |
|
|
* **Interactive Assistants:** Assisting human operators or passengers with scene descriptions. |
|
|
|
|
|
## β οΈ Limitations |
|
|
* **Hallucination:** Like all VLMs, it may occasionally misinterpret distant objects or complex social traffic cues. |
|
|
* **Geographical Bias:** Performance may vary in regions or weather conditions not heavily represented in the training data. |
|
|
* **Non-Control:** This model is for **reasoning and explanation**, not for direct vehicle control. |