DriveFusionQA / README.md
OmarSamir's picture
Update README.md
6f69795 verified
metadata
library_name: transformers
license: apache-2.0
language:
  - en
base_model:
  - Qwen/Qwen2.5-VL-3B-Instruct
tags:
  - VLM
  - LLM
  - DriveFusion
  - Vision
  - MultiModal
pipeline_tag: image-text-to-text

DriveFusionQA Model Card

DriveFusion Logo

DriveFusionQA

An Autonomous Driving Vision-Language Model for Scenario Understanding & Decision Reasoning.

Model License Base Model Status


πŸš™ Model Description

DriveFusionQA is a specialized Vision-Language Model (VLM) fine-tuned to interpret complex driving scenes and explain vehicle decision-making. Built on the Qwen2.5-VL architecture, it bridges the gap between raw sensor data and human-understandable reasoning.

Unlike general-purpose models, DriveFusionQA is specifically optimized to answer the "why" behind driving maneuvers, making it an essential tool for safety analysis, simulation, and interactive driving support.

πŸ”— GitHub Repository

Find the full implementation, training scripts, and preprocessing logic here:

Core Capabilities

  • Scenario Explanation: Identifies traffic participants, road signs, and environmental hazards.
  • Decision Reasoning: Justifies driving actions (e.g., "Braking due to a pedestrian entering the crosswalk").
  • Multi-Dataset Expertise: Leverages a unified pipeline of world-class driving benchmarks.
  • Interactive Dialogue: Supports multi-turn conversations regarding road safety and navigation.

πŸ“Š Model Performance

DriveFusionQA demonstrates significant improvements over the base model across all key driving-related language metrics. The substantial increase in Lingo-Judge scores reflects its superior ability to generate human-aligned driving reasoning.

Model Lingo-Judge METEOR CIDEr BLEU
DriveFusion QA 53.2 0.3327 0.1602 0.0853
Qwen2.5-VL Base 38.1 0.2577 0.1024 0.0259

πŸ“š Training & Data

The model was trained using the DriveFusion Data Preprocessing pipeline, which standardizes diverse autonomous driving datasets into a unified format.

Key Datasets Included:

  • LingoQA: Action-focused scenery and decision components.
  • DriveGPT4 + BDD-X: Human-like driving explanations and logic.
  • DriveLM: Graph-based reasoning for autonomous driving.

πŸš€ Quick Start

Ensure you have the latest transformers library installed to support the Qwen2.5-VL architecture.

Installation

pip install transformers accelerate pillow torch

Inference Example

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

model_id = "DriveFusion/DriveFusionQA"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Load driving scene
image = Image.open("driving_sample.jpg")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe the current driving scenario and any potential risks."},
        ],
    }
]

# Generate Response
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")

output_ids = model.generate(**inputs, max_new_tokens=256)
response = processor.batch_decode(output_ids, skip_special_tokens=True)
print(response[0])

πŸ›  Intended Use

  • Safety Analysis: Generating natural language reports for dashcam footage and near-miss events.
  • Training & Simulation: Providing ground-truth explanations for AI driver training.
  • Interactive Assistants: Assisting human operators or passengers with scene descriptions.

⚠️ Limitations

  • Hallucination: Like all VLMs, it may occasionally misinterpret distant objects or complex social traffic cues.
  • Geographical Bias: Performance may vary in regions or weather conditions not heavily represented in the training data.
  • Non-Control: This model is for reasoning and explanation, not for direct vehicle control.