DriveFusion-V0.2 / README.md

OmarSamir

Update README.md

b285a67 verified about 20 hours ago

preview code

raw

history blame contribute delete

4.51 kB

metadata

library_name: transformers
license: apache-2.0
language:
  - en
base_model:
  - Qwen/Qwen2.5-VL-3B-Instruct
tags:
  - VLA
  - VLM
  - LLM
  - DriveFusion
  - Vision
  - MultiModal
pipeline_tag: image-text-to-text

DriveFusion-V0.2 Model Card

DriveFusion-V0.2

A Multimodal Autonomous Driving Model: Vision + Language + GPS + Speed.

🚀 Model Overview

DriveFusion-V0.2 is a multimodal model designed for autonomous vehicle applications. Unlike standard Vision-Language models, V0.2 integrates telemetry data (GPS and Speed) directly into the transformer architecture to perform dual tasks:

Natural Language Reasoning: Describing scenes and explaining driving decisions.
Trajectory & Speed Prediction: Outputting coordinates for future waypoints and target velocity profiles.

Built on the Qwen2.5-VL foundation, DriveFusion-V0.2 adds specialized MLP heads to fuse physical context with visual features, enabling a comprehensive "world model" for driving.

🔗 GitHub Repository

Find the full implementation, training scripts, and preprocessing logic here:

Main Model Code: DriveFusion/drivefusion
Data Collection: DriveFusion/data-collection

Core Features

Vision Processing: Handles images and videos via a 32-layer Vision Transformer.
Context Fusion: Custom SpeedMLP and GPSTargetPointsMLP integrate vehicle telemetry.
Predictive Heads: Generates 20 trajectory waypoints and 10 target speed values.
Reasoning: Full natural language generation for "Chain of Thought" driving explanations.

🏗 Architecture

DriveFusion-V0.2 extends the Qwen2.5-VL architecture with a modular "Driving Intelligence" layer.

The DriveFusion-V0.2 Architecture: Integrating visual tokens with telemetry-encoded tokens for dual-head output.

Technical Specifications

Text Encoder: Qwen2.5-VL (36 Transformer layers).
Vision Encoder: 32-layer ViT with configurable patch sizes.
Driving MLPs:
- TrajectoryMLP: Generates batch × 20 × 2 coordinates.
- TargetSpeedMLP: Generates batch × 10 × 2 velocity values.
Context Window: 128k tokens.

🚀 Quick Start

Using DriveFusion-V0.2 requires the custom DriveFusionProcessor to handle the fusion of text, images, and telemetry.

Installation

pip install transformers accelerate qwen-vl-utils torch

Inference Example

import torch
from drivefusion import DriveFusionForConditionalGeneration, DriveFusionProcessor

# Load Model
model_id = "DriveFusion/DriveFusion-V0.2"
model = DriveFusionForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
processor = DriveFusionProcessor.from_pretrained(model_id)

# Define Input: Image + Prompt + Telemetry
gps_context = [[40.7128, -74.0060], [40.7130, -74.0058]] # Lat/Lon history
speed_context = [[30.5]] # Current speed in m/s

message = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "highway_scene.jpg"},
        {"type": "text", "text": "Analyze the scene and predict the next trajectory based on our current speed."}
    ]
}]

# Generate
inputs = processor(text=message, images="highway_scene.jpg", gps=gps_context, speed=speed_context, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=512)

# Results
print("Reasoning:", output["text"])
print("Predicted Trajectory (20 pts):", output["trajectory"])
print("Target Speeds:", output["target_speeds"])

🛠 Intended Use

End-to-End Autonomous Driving: Acting as a primary planner or a redundant safety checker.
Explainable AI (XAI): Providing human-readable justifications for automated maneuvers.
Sim-to-Real Transfer: Using the model as a sophisticated "expert" driver in simulated environments.