--- library_name: transformers license: apache-2.0 language: - en base_model: - Qwen/Qwen2.5-VL-3B-Instruct tags: - VLM - LLM - DriveFusion - Vision - MultiModal pipeline_tag: image-text-to-text --- # DriveFusionQA Model Card
DriveFusion Logo

DriveFusionQA

An Autonomous Driving Vision-Language Model for Scenario Understanding & Decision Reasoning.

[![Model License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0) [![Base Model](https://img.shields.io/badge/Base%20Model-Qwen2.5--VL-blue)](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) [![Status](https://img.shields.io/badge/Status-Active-success.svg)]()
--- ## 🚙 Model Description **DriveFusionQA** is a specialized Vision-Language Model (VLM) fine-tuned to interpret complex driving scenes and explain vehicle decision-making. Built on the **Qwen2.5-VL** architecture, it bridges the gap between raw sensor data and human-understandable reasoning. Unlike general-purpose models, DriveFusionQA is specifically optimized to answer the "why" behind driving maneuvers, making it an essential tool for safety analysis, simulation, and interactive driving support. ## 🔗 GitHub Repository Find the full implementation, training scripts, and preprocessing logic here: * **Main Model Code:** [DriveFusion/drivefusion](https://github.com/DriveFusion/drivefusion) * **Data Pipeline:** [DriveFusion/data-preprocessing](https://github.com/DriveFusion/data-preprocessing) ### Core Capabilities * **Scenario Explanation:** Identifies traffic participants, road signs, and environmental hazards. * **Decision Reasoning:** Justifies driving actions (e.g., "Braking due to a pedestrian entering the crosswalk"). * **Multi-Dataset Expertise:** Leverages a unified pipeline of world-class driving benchmarks. * **Interactive Dialogue:** Supports multi-turn conversations regarding road safety and navigation. --- ## 📊 Model Performance DriveFusionQA demonstrates significant improvements over the base model across all key driving-related language metrics. The substantial increase in **Lingo-Judge** scores reflects its superior ability to generate human-aligned driving reasoning. | Model | Lingo-Judge | METEOR | CIDEr | BLEU | | :--- | :---: | :---: | :---: | :---: | | **DriveFusion QA** | **53.2** | **0.3327** | **0.1602** | **0.0853** | | Qwen2.5-VL Base | 38.1 | 0.2577 | 0.1024 | 0.0259 | --- ## 📚 Training & Data The model was trained using the [DriveFusion Data Preprocessing](https://github.com/DriveFusion/data-preprocessing) pipeline, which standardizes diverse autonomous driving datasets into a unified format. **Key Datasets Included:** * **LingoQA:** Action-focused scenery and decision components. * **DriveGPT4 + BDD-X:** Human-like driving explanations and logic. * **DriveLM:** Graph-based reasoning for autonomous driving. --- ## 🚀 Quick Start Ensure you have the latest `transformers` library installed to support the Qwen2.5-VL architecture. ### Installation ```bash pip install transformers accelerate pillow torch ``` ### Inference Example ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor from PIL import Image import torch model_id = "DriveFusion/DriveFusionQA" model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_id, torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained(model_id) # Load driving scene image = Image.open("driving_sample.jpg") messages = [ { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "Describe the current driving scenario and any potential risks."}, ], } ] # Generate Response text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda") output_ids = model.generate(**inputs, max_new_tokens=256) response = processor.batch_decode(output_ids, skip_special_tokens=True) print(response[0]) ``` --- ## 🛠 Intended Use * **Safety Analysis:** Generating natural language reports for dashcam footage and near-miss events. * **Training & Simulation:** Providing ground-truth explanations for AI driver training. * **Interactive Assistants:** Assisting human operators or passengers with scene descriptions. ## ⚠️ Limitations * **Hallucination:** Like all VLMs, it may occasionally misinterpret distant objects or complex social traffic cues. * **Geographical Bias:** Performance may vary in regions or weather conditions not heavily represented in the training data. * **Non-Control:** This model is for **reasoning and explanation**, not for direct vehicle control.