File size: 4,514 Bytes

e110f15
 
228f672
 
 
 
 
9f3ef1c
 
 
 
 
228f672
 
0e36a1f
e110f15
 
be15d8e
e110f15
be15d8e
fd64a39
be15d8e
 
e110f15
be15d8e
 
 
 
e110f15
be15d8e
e110f15
be15d8e
e110f15
7e6946e
be15d8e
 
e110f15
be15d8e
e110f15
af5c02f
 
 
 
 
 
be15d8e
 
 
 
 
e110f15
be15d8e
e110f15
be15d8e
e110f15
be15d8e
e110f15
f56f884
 
 
 
 
be15d8e
 
 
 
 
b285a67
be15d8e
e110f15
be15d8e
e110f15
be15d8e
e110f15
be15d8e
e110f15
be15d8e
 
 
 
e110f15
be15d8e
 
 
 
e110f15
be15d8e
 
 
 
e110f15
be15d8e
 
 
e110f15
be15d8e
 
 
 
 
 
 
e110f15
be15d8e
 
 
e110f15
be15d8e
 
 
 
 
e110f15
be15d8e
e110f15
be15d8e
 
 
9f3ef1c

---
library_name: transformers
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
tags:
- VLA
- VLM
- LLM
- DriveFusion
- Vision
- MultiModal
pipeline_tag: image-text-to-text
---

# DriveFusion-V0.2 Model Card

<div align="center">
  <img src="drivefusion_logo.png" alt="DriveFusion Logo" width="300"/>
  <h1>DriveFusion-V0.2</h1>
  <p><strong>A Multimodal Autonomous Driving Model: Vision + Language + GPS + Speed.</strong></p>

  [![Model License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
  [![Base Model](https://img.shields.io/badge/Base%20Model-Qwen2.5--VL-blue)](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
  [![Status](https://img.shields.io/badge/Status-Active-success.svg)]()
</div>

---

## 🚀 Model Overview

**DriveFusion-V0.2** is a multimodal model designed for autonomous vehicle applications. Unlike standard Vision-Language models, V0.2 integrates **telemetry data** (GPS and Speed) directly into the transformer architecture to perform dual tasks:
1. **Natural Language Reasoning**: Describing scenes and explaining driving decisions.
2. **Trajectory & Speed Prediction**: Outputting coordinates for future waypoints and target velocity profiles.

Built on the **Qwen2.5-VL** foundation, DriveFusion-V0.2 adds specialized MLP heads to fuse physical context with visual features, enabling a comprehensive "world model" for driving.

## 🔗 GitHub Repository

Find the full implementation, training scripts, and preprocessing logic here:
* **Main Model Code:** [DriveFusion/drivefusion](https://github.com/DriveFusion/drivefusion)
* **Data Collection:** [DriveFusion/data-collection](https://github.com/DriveFusion/carla-data-collection.git)

### Core Features
- **Vision Processing**: Handles images and videos via a 32-layer Vision Transformer.
- **Context Fusion**: Custom `SpeedMLP` and `GPSTargetPointsMLP` integrate vehicle telemetry.
- **Predictive Heads**: Generates **20 trajectory waypoints** and **10 target speed values**.
- **Reasoning**: Full natural language generation for "Chain of Thought" driving explanations.

---

## 🏗 Architecture

DriveFusion-V0.2 extends the Qwen2.5-VL architecture with a modular "Driving Intelligence" layer.

<div align="left">
  <img src="drivefusion_architectures.png" alt="DriveFusion Architecture" width="700"/>
  <p><i>The DriveFusion-V0.2 Architecture: Integrating visual tokens with telemetry-encoded tokens for dual-head output.</i></p>
</div>

### Technical Specifications
- **Text Encoder**: Qwen2.5-VL (36 Transformer layers).
- **Vision Encoder**: 32-layer ViT with configurable patch sizes.
- **Driving MLPs**: 
  - `TrajectoryMLP`: Generates batch × 20 × 2 coordinates.
  - `TargetSpeedMLP`: Generates batch × 10 × 2 velocity values.
- **Context Window**: 128k tokens.

---

## 🚀 Quick Start

Using DriveFusion-V0.2 requires the custom `DriveFusionProcessor` to handle the fusion of text, images, and telemetry.

### Installation
```bash
pip install transformers accelerate qwen-vl-utils torch
```

### Inference Example
```python
import torch
from drivefusion import DriveFusionForConditionalGeneration, DriveFusionProcessor

# Load Model
model_id = "DriveFusion/DriveFusion-V0.2"
model = DriveFusionForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
processor = DriveFusionProcessor.from_pretrained(model_id)

# Define Input: Image + Prompt + Telemetry
gps_context = [[40.7128, -74.0060], [40.7130, -74.0058]] # Lat/Lon history
speed_context = [[30.5]] # Current speed in m/s

message = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "highway_scene.jpg"},
        {"type": "text", "text": "Analyze the scene and predict the next trajectory based on our current speed."}
    ]
}]

# Generate
inputs = processor(text=message, images="highway_scene.jpg", gps=gps_context, speed=speed_context, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=512)

# Results
print("Reasoning:", output["text"])
print("Predicted Trajectory (20 pts):", output["trajectory"])
print("Target Speeds:", output["target_speeds"])
```

---

## 🛠 Intended Use
- **End-to-End Autonomous Driving**: Acting as a primary planner or a redundant safety checker.
- **Explainable AI (XAI)**: Providing human-readable justifications for automated maneuvers.
- **Sim-to-Real Transfer**: Using the model as a sophisticated "expert" driver in simulated environments.