File size: 4,514 Bytes
e110f15 228f672 9f3ef1c 228f672 0e36a1f e110f15 be15d8e e110f15 be15d8e fd64a39 be15d8e e110f15 be15d8e e110f15 be15d8e e110f15 be15d8e e110f15 7e6946e be15d8e e110f15 be15d8e e110f15 af5c02f be15d8e e110f15 be15d8e e110f15 be15d8e e110f15 be15d8e e110f15 f56f884 be15d8e b285a67 be15d8e e110f15 be15d8e e110f15 be15d8e e110f15 be15d8e e110f15 be15d8e e110f15 be15d8e e110f15 be15d8e e110f15 be15d8e e110f15 be15d8e e110f15 be15d8e e110f15 be15d8e e110f15 be15d8e e110f15 be15d8e 9f3ef1c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
---
library_name: transformers
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
tags:
- VLA
- VLM
- LLM
- DriveFusion
- Vision
- MultiModal
pipeline_tag: image-text-to-text
---
# DriveFusion-V0.2 Model Card
<div align="center">
<img src="drivefusion_logo.png" alt="DriveFusion Logo" width="300"/>
<h1>DriveFusion-V0.2</h1>
<p><strong>A Multimodal Autonomous Driving Model: Vision + Language + GPS + Speed.</strong></p>
[](https://opensource.org/licenses/Apache-2.0)
[](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
[]()
</div>
---
## π Model Overview
**DriveFusion-V0.2** is a multimodal model designed for autonomous vehicle applications. Unlike standard Vision-Language models, V0.2 integrates **telemetry data** (GPS and Speed) directly into the transformer architecture to perform dual tasks:
1. **Natural Language Reasoning**: Describing scenes and explaining driving decisions.
2. **Trajectory & Speed Prediction**: Outputting coordinates for future waypoints and target velocity profiles.
Built on the **Qwen2.5-VL** foundation, DriveFusion-V0.2 adds specialized MLP heads to fuse physical context with visual features, enabling a comprehensive "world model" for driving.
## π GitHub Repository
Find the full implementation, training scripts, and preprocessing logic here:
* **Main Model Code:** [DriveFusion/drivefusion](https://github.com/DriveFusion/drivefusion)
* **Data Collection:** [DriveFusion/data-collection](https://github.com/DriveFusion/carla-data-collection.git)
### Core Features
- **Vision Processing**: Handles images and videos via a 32-layer Vision Transformer.
- **Context Fusion**: Custom `SpeedMLP` and `GPSTargetPointsMLP` integrate vehicle telemetry.
- **Predictive Heads**: Generates **20 trajectory waypoints** and **10 target speed values**.
- **Reasoning**: Full natural language generation for "Chain of Thought" driving explanations.
---
## π Architecture
DriveFusion-V0.2 extends the Qwen2.5-VL architecture with a modular "Driving Intelligence" layer.
<div align="left">
<img src="drivefusion_architectures.png" alt="DriveFusion Architecture" width="700"/>
<p><i>The DriveFusion-V0.2 Architecture: Integrating visual tokens with telemetry-encoded tokens for dual-head output.</i></p>
</div>
### Technical Specifications
- **Text Encoder**: Qwen2.5-VL (36 Transformer layers).
- **Vision Encoder**: 32-layer ViT with configurable patch sizes.
- **Driving MLPs**:
- `TrajectoryMLP`: Generates batch Γ 20 Γ 2 coordinates.
- `TargetSpeedMLP`: Generates batch Γ 10 Γ 2 velocity values.
- **Context Window**: 128k tokens.
---
## π Quick Start
Using DriveFusion-V0.2 requires the custom `DriveFusionProcessor` to handle the fusion of text, images, and telemetry.
### Installation
```bash
pip install transformers accelerate qwen-vl-utils torch
```
### Inference Example
```python
import torch
from drivefusion import DriveFusionForConditionalGeneration, DriveFusionProcessor
# Load Model
model_id = "DriveFusion/DriveFusion-V0.2"
model = DriveFusionForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
processor = DriveFusionProcessor.from_pretrained(model_id)
# Define Input: Image + Prompt + Telemetry
gps_context = [[40.7128, -74.0060], [40.7130, -74.0058]] # Lat/Lon history
speed_context = [[30.5]] # Current speed in m/s
message = [{
"role": "user",
"content": [
{"type": "image", "image": "highway_scene.jpg"},
{"type": "text", "text": "Analyze the scene and predict the next trajectory based on our current speed."}
]
}]
# Generate
inputs = processor(text=message, images="highway_scene.jpg", gps=gps_context, speed=speed_context, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=512)
# Results
print("Reasoning:", output["text"])
print("Predicted Trajectory (20 pts):", output["trajectory"])
print("Target Speeds:", output["target_speeds"])
```
---
## π Intended Use
- **End-to-End Autonomous Driving**: Acting as a primary planner or a redundant safety checker.
- **Explainable AI (XAI)**: Providing human-readable justifications for automated maneuvers.
- **Sim-to-Real Transfer**: Using the model as a sophisticated "expert" driver in simulated environments. |