DriveFusion-V0.2 / README.md

Update README.md

b285a67 verified about 22 hours ago

4.51 kB

	---
	library_name: transformers
	license: apache-2.0
	language:
	- en
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	tags:
	- VLA
	- VLM
	- LLM
	- DriveFusion
	- Vision
	- MultiModal
	pipeline_tag: image-text-to-text
	---

	# DriveFusion-V0.2 Model Card

	<div align="center">
	<img src="drivefusion_logo.png" alt="DriveFusion Logo" width="300"/>
	<h1>DriveFusion-V0.2</h1>
	<p><strong>A Multimodal Autonomous Driving Model: Vision + Language + GPS + Speed.</strong></p>

	[![Model License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
	[![Base Model](https://img.shields.io/badge/Base%20Model-Qwen2.5--VL-blue)](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
	[![Status](https://img.shields.io/badge/Status-Active-success.svg)]()
	</div>

	---

	## 🚀 Model Overview

	DriveFusion-V0.2 is a multimodal model designed for autonomous vehicle applications. Unlike standard Vision-Language models, V0.2 integrates telemetry data (GPS and Speed) directly into the transformer architecture to perform dual tasks:
	1. Natural Language Reasoning: Describing scenes and explaining driving decisions.
	2. Trajectory & Speed Prediction: Outputting coordinates for future waypoints and target velocity profiles.

	Built on the Qwen2.5-VL foundation, DriveFusion-V0.2 adds specialized MLP heads to fuse physical context with visual features, enabling a comprehensive "world model" for driving.

	## 🔗 GitHub Repository

	Find the full implementation, training scripts, and preprocessing logic here:
	* Main Model Code: [DriveFusion/drivefusion](https://github.com/DriveFusion/drivefusion)
	* Data Collection: [DriveFusion/data-collection](https://github.com/DriveFusion/carla-data-collection.git)

	### Core Features
	- Vision Processing: Handles images and videos via a 32-layer Vision Transformer.
	- Context Fusion: Custom `SpeedMLP` and `GPSTargetPointsMLP` integrate vehicle telemetry.
	- Predictive Heads: Generates 20 trajectory waypoints and 10 target speed values.
	- Reasoning: Full natural language generation for "Chain of Thought" driving explanations.

	---

	## 🏗 Architecture

	DriveFusion-V0.2 extends the Qwen2.5-VL architecture with a modular "Driving Intelligence" layer.

	<div align="left">
	<img src="drivefusion_architectures.png" alt="DriveFusion Architecture" width="700"/>
	<p><i>The DriveFusion-V0.2 Architecture: Integrating visual tokens with telemetry-encoded tokens for dual-head output.</i></p>
	</div>

	### Technical Specifications
	- Text Encoder: Qwen2.5-VL (36 Transformer layers).
	- Vision Encoder: 32-layer ViT with configurable patch sizes.
	- Driving MLPs:
	- `TrajectoryMLP`: Generates batch × 20 × 2 coordinates.
	- `TargetSpeedMLP`: Generates batch × 10 × 2 velocity values.
	- Context Window: 128k tokens.

	---

	## 🚀 Quick Start

	Using DriveFusion-V0.2 requires the custom `DriveFusionProcessor` to handle the fusion of text, images, and telemetry.

	### Installation
	```bash
	pip install transformers accelerate qwen-vl-utils torch
	```

	### Inference Example
	```python
	import torch
	from drivefusion import DriveFusionForConditionalGeneration, DriveFusionProcessor

	# Load Model
	model_id = "DriveFusion/DriveFusion-V0.2"
	model = DriveFusionForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
	processor = DriveFusionProcessor.from_pretrained(model_id)

	# Define Input: Image + Prompt + Telemetry
	gps_context = [[40.7128, -74.0060], [40.7130, -74.0058]] # Lat/Lon history
	speed_context = [[30.5]] # Current speed in m/s

	message = [{
	"role": "user",
	"content": [
	{"type": "image", "image": "highway_scene.jpg"},
	{"type": "text", "text": "Analyze the scene and predict the next trajectory based on our current speed."}
	]
	}]

	# Generate
	inputs = processor(text=message, images="highway_scene.jpg", gps=gps_context, speed=speed_context, return_tensors="pt").to("cuda")
	output = model.generate(**inputs, max_new_tokens=512)

	# Results
	print("Reasoning:", output["text"])
	print("Predicted Trajectory (20 pts):", output["trajectory"])
	print("Target Speeds:", output["target_speeds"])
	```

	---

	## 🛠 Intended Use
	- End-to-End Autonomous Driving: Acting as a primary planner or a redundant safety checker.
	- Explainable AI (XAI): Providing human-readable justifications for automated maneuvers.
	- Sim-to-Real Transfer: Using the model as a sophisticated "expert" driver in simulated environments.