DriveFusion-V0.2 / README.md
OmarSamir's picture
Update README.md
b285a67 verified
---
library_name: transformers
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
tags:
- VLA
- VLM
- LLM
- DriveFusion
- Vision
- MultiModal
pipeline_tag: image-text-to-text
---
# DriveFusion-V0.2 Model Card
<div align="center">
<img src="drivefusion_logo.png" alt="DriveFusion Logo" width="300"/>
<h1>DriveFusion-V0.2</h1>
<p><strong>A Multimodal Autonomous Driving Model: Vision + Language + GPS + Speed.</strong></p>
[![Model License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
[![Base Model](https://img.shields.io/badge/Base%20Model-Qwen2.5--VL-blue)](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
[![Status](https://img.shields.io/badge/Status-Active-success.svg)]()
</div>
---
## πŸš€ Model Overview
**DriveFusion-V0.2** is a multimodal model designed for autonomous vehicle applications. Unlike standard Vision-Language models, V0.2 integrates **telemetry data** (GPS and Speed) directly into the transformer architecture to perform dual tasks:
1. **Natural Language Reasoning**: Describing scenes and explaining driving decisions.
2. **Trajectory & Speed Prediction**: Outputting coordinates for future waypoints and target velocity profiles.
Built on the **Qwen2.5-VL** foundation, DriveFusion-V0.2 adds specialized MLP heads to fuse physical context with visual features, enabling a comprehensive "world model" for driving.
## πŸ”— GitHub Repository
Find the full implementation, training scripts, and preprocessing logic here:
* **Main Model Code:** [DriveFusion/drivefusion](https://github.com/DriveFusion/drivefusion)
* **Data Collection:** [DriveFusion/data-collection](https://github.com/DriveFusion/carla-data-collection.git)
### Core Features
- **Vision Processing**: Handles images and videos via a 32-layer Vision Transformer.
- **Context Fusion**: Custom `SpeedMLP` and `GPSTargetPointsMLP` integrate vehicle telemetry.
- **Predictive Heads**: Generates **20 trajectory waypoints** and **10 target speed values**.
- **Reasoning**: Full natural language generation for "Chain of Thought" driving explanations.
---
## πŸ— Architecture
DriveFusion-V0.2 extends the Qwen2.5-VL architecture with a modular "Driving Intelligence" layer.
<div align="left">
<img src="drivefusion_architectures.png" alt="DriveFusion Architecture" width="700"/>
<p><i>The DriveFusion-V0.2 Architecture: Integrating visual tokens with telemetry-encoded tokens for dual-head output.</i></p>
</div>
### Technical Specifications
- **Text Encoder**: Qwen2.5-VL (36 Transformer layers).
- **Vision Encoder**: 32-layer ViT with configurable patch sizes.
- **Driving MLPs**:
- `TrajectoryMLP`: Generates batch Γ— 20 Γ— 2 coordinates.
- `TargetSpeedMLP`: Generates batch Γ— 10 Γ— 2 velocity values.
- **Context Window**: 128k tokens.
---
## πŸš€ Quick Start
Using DriveFusion-V0.2 requires the custom `DriveFusionProcessor` to handle the fusion of text, images, and telemetry.
### Installation
```bash
pip install transformers accelerate qwen-vl-utils torch
```
### Inference Example
```python
import torch
from drivefusion import DriveFusionForConditionalGeneration, DriveFusionProcessor
# Load Model
model_id = "DriveFusion/DriveFusion-V0.2"
model = DriveFusionForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
processor = DriveFusionProcessor.from_pretrained(model_id)
# Define Input: Image + Prompt + Telemetry
gps_context = [[40.7128, -74.0060], [40.7130, -74.0058]] # Lat/Lon history
speed_context = [[30.5]] # Current speed in m/s
message = [{
"role": "user",
"content": [
{"type": "image", "image": "highway_scene.jpg"},
{"type": "text", "text": "Analyze the scene and predict the next trajectory based on our current speed."}
]
}]
# Generate
inputs = processor(text=message, images="highway_scene.jpg", gps=gps_context, speed=speed_context, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=512)
# Results
print("Reasoning:", output["text"])
print("Predicted Trajectory (20 pts):", output["trajectory"])
print("Target Speeds:", output["target_speeds"])
```
---
## πŸ›  Intended Use
- **End-to-End Autonomous Driving**: Acting as a primary planner or a redundant safety checker.
- **Explainable AI (XAI)**: Providing human-readable justifications for automated maneuvers.
- **Sim-to-Real Transfer**: Using the model as a sophisticated "expert" driver in simulated environments.