DriveFusion-V0.2 Model Card

DriveFusion Logo

DriveFusion-V0.2

A Multimodal Autonomous Driving Model: Vision + Language + GPS + Speed.

Model License Base Model Status


πŸš€ Model Overview

DriveFusion-V0.2 is a multimodal model designed for autonomous vehicle applications. Unlike standard Vision-Language models, V0.2 integrates telemetry data (GPS and Speed) directly into the transformer architecture to perform dual tasks:

  1. Natural Language Reasoning: Describing scenes and explaining driving decisions.
  2. Trajectory & Speed Prediction: Outputting coordinates for future waypoints and target velocity profiles.

Built on the Qwen2.5-VL foundation, DriveFusion-V0.2 adds specialized MLP heads to fuse physical context with visual features, enabling a comprehensive "world model" for driving.

πŸ”— GitHub Repository

Find the full implementation, training scripts, and preprocessing logic here:

Core Features

  • Vision Processing: Handles images and videos via a 32-layer Vision Transformer.
  • Context Fusion: Custom SpeedMLP and GPSTargetPointsMLP integrate vehicle telemetry.
  • Predictive Heads: Generates 20 trajectory waypoints and 10 target speed values.
  • Reasoning: Full natural language generation for "Chain of Thought" driving explanations.

πŸ— Architecture

DriveFusion-V0.2 extends the Qwen2.5-VL architecture with a modular "Driving Intelligence" layer.

DriveFusion Architecture

The DriveFusion-V0.2 Architecture: Integrating visual tokens with telemetry-encoded tokens for dual-head output.

Technical Specifications

  • Text Encoder: Qwen2.5-VL (36 Transformer layers).
  • Vision Encoder: 32-layer ViT with configurable patch sizes.
  • Driving MLPs:
    • TrajectoryMLP: Generates batch Γ— 20 Γ— 2 coordinates.
    • TargetSpeedMLP: Generates batch Γ— 10 Γ— 2 velocity values.
  • Context Window: 128k tokens.

πŸš€ Quick Start

Using DriveFusion-V0.2 requires the custom DriveFusionProcessor to handle the fusion of text, images, and telemetry.

Installation

pip install transformers accelerate qwen-vl-utils torch

Inference Example

import torch
from drivefusion import DriveFusionForConditionalGeneration, DriveFusionProcessor

# Load Model
model_id = "DriveFusion/DriveFusion-V0.2"
model = DriveFusionForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
processor = DriveFusionProcessor.from_pretrained(model_id)

# Define Input: Image + Prompt + Telemetry
gps_context = [[40.7128, -74.0060], [40.7130, -74.0058]] # Lat/Lon history
speed_context = [[30.5]] # Current speed in m/s

message = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "highway_scene.jpg"},
        {"type": "text", "text": "Analyze the scene and predict the next trajectory based on our current speed."}
    ]
}]

# Generate
inputs = processor(text=message, images="highway_scene.jpg", gps=gps_context, speed=speed_context, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=512)

# Results
print("Reasoning:", output["text"])
print("Predicted Trajectory (20 pts):", output["trajectory"])
print("Target Speeds:", output["target_speeds"])

πŸ›  Intended Use

  • End-to-End Autonomous Driving: Acting as a primary planner or a redundant safety checker.
  • Explainable AI (XAI): Providing human-readable justifications for automated maneuvers.
  • Sim-to-Real Transfer: Using the model as a sophisticated "expert" driver in simulated environments.
Downloads last month
-
Safetensors
Model size
4B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for DriveFusion/DriveFusion-V0.2

Finetuned
(654)
this model