|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-VL-3B-Instruct |
|
|
tags: |
|
|
- VLA |
|
|
- VLM |
|
|
- LLM |
|
|
- DriveFusion |
|
|
- Vision |
|
|
- MultiModal |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
|
|
|
# DriveFusion-V0.2 Model Card |
|
|
|
|
|
<div align="center"> |
|
|
<img src="drivefusion_logo.png" alt="DriveFusion Logo" width="300"/> |
|
|
<h1>DriveFusion-V0.2</h1> |
|
|
<p><strong>A Multimodal Autonomous Driving Model: Vision + Language + GPS + Speed.</strong></p> |
|
|
|
|
|
[](https://opensource.org/licenses/Apache-2.0) |
|
|
[](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) |
|
|
[]() |
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Overview |
|
|
|
|
|
**DriveFusion-V0.2** is a multimodal model designed for autonomous vehicle applications. Unlike standard Vision-Language models, V0.2 integrates **telemetry data** (GPS and Speed) directly into the transformer architecture to perform dual tasks: |
|
|
1. **Natural Language Reasoning**: Describing scenes and explaining driving decisions. |
|
|
2. **Trajectory & Speed Prediction**: Outputting coordinates for future waypoints and target velocity profiles. |
|
|
|
|
|
Built on the **Qwen2.5-VL** foundation, DriveFusion-V0.2 adds specialized MLP heads to fuse physical context with visual features, enabling a comprehensive "world model" for driving. |
|
|
|
|
|
## π GitHub Repository |
|
|
|
|
|
Find the full implementation, training scripts, and preprocessing logic here: |
|
|
* **Main Model Code:** [DriveFusion/drivefusion](https://github.com/DriveFusion/drivefusion) |
|
|
* **Data Collection:** [DriveFusion/data-collection](https://github.com/DriveFusion/carla-data-collection.git) |
|
|
|
|
|
### Core Features |
|
|
- **Vision Processing**: Handles images and videos via a 32-layer Vision Transformer. |
|
|
- **Context Fusion**: Custom `SpeedMLP` and `GPSTargetPointsMLP` integrate vehicle telemetry. |
|
|
- **Predictive Heads**: Generates **20 trajectory waypoints** and **10 target speed values**. |
|
|
- **Reasoning**: Full natural language generation for "Chain of Thought" driving explanations. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Architecture |
|
|
|
|
|
DriveFusion-V0.2 extends the Qwen2.5-VL architecture with a modular "Driving Intelligence" layer. |
|
|
|
|
|
<div align="left"> |
|
|
<img src="drivefusion_architectures.png" alt="DriveFusion Architecture" width="700"/> |
|
|
<p><i>The DriveFusion-V0.2 Architecture: Integrating visual tokens with telemetry-encoded tokens for dual-head output.</i></p> |
|
|
</div> |
|
|
|
|
|
### Technical Specifications |
|
|
- **Text Encoder**: Qwen2.5-VL (36 Transformer layers). |
|
|
- **Vision Encoder**: 32-layer ViT with configurable patch sizes. |
|
|
- **Driving MLPs**: |
|
|
- `TrajectoryMLP`: Generates batch Γ 20 Γ 2 coordinates. |
|
|
- `TargetSpeedMLP`: Generates batch Γ 10 Γ 2 velocity values. |
|
|
- **Context Window**: 128k tokens. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Quick Start |
|
|
|
|
|
Using DriveFusion-V0.2 requires the custom `DriveFusionProcessor` to handle the fusion of text, images, and telemetry. |
|
|
|
|
|
### Installation |
|
|
```bash |
|
|
pip install transformers accelerate qwen-vl-utils torch |
|
|
``` |
|
|
|
|
|
### Inference Example |
|
|
```python |
|
|
import torch |
|
|
from drivefusion import DriveFusionForConditionalGeneration, DriveFusionProcessor |
|
|
|
|
|
# Load Model |
|
|
model_id = "DriveFusion/DriveFusion-V0.2" |
|
|
model = DriveFusionForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") |
|
|
processor = DriveFusionProcessor.from_pretrained(model_id) |
|
|
|
|
|
# Define Input: Image + Prompt + Telemetry |
|
|
gps_context = [[40.7128, -74.0060], [40.7130, -74.0058]] # Lat/Lon history |
|
|
speed_context = [[30.5]] # Current speed in m/s |
|
|
|
|
|
message = [{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image", "image": "highway_scene.jpg"}, |
|
|
{"type": "text", "text": "Analyze the scene and predict the next trajectory based on our current speed."} |
|
|
] |
|
|
}] |
|
|
|
|
|
# Generate |
|
|
inputs = processor(text=message, images="highway_scene.jpg", gps=gps_context, speed=speed_context, return_tensors="pt").to("cuda") |
|
|
output = model.generate(**inputs, max_new_tokens=512) |
|
|
|
|
|
# Results |
|
|
print("Reasoning:", output["text"]) |
|
|
print("Predicted Trajectory (20 pts):", output["trajectory"]) |
|
|
print("Target Speeds:", output["target_speeds"]) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Intended Use |
|
|
- **End-to-End Autonomous Driving**: Acting as a primary planner or a redundant safety checker. |
|
|
- **Explainable AI (XAI)**: Providing human-readable justifications for automated maneuvers. |
|
|
- **Sim-to-Real Transfer**: Using the model as a sophisticated "expert" driver in simulated environments. |