--- library_name: transformers license: apache-2.0 language: - en base_model: - Qwen/Qwen2.5-VL-3B-Instruct tags: - VLA - VLM - LLM - DriveFusion - Vision - MultiModal pipeline_tag: image-text-to-text --- # DriveFusion-V0.2 Model Card
DriveFusion Logo

DriveFusion-V0.2

A Multimodal Autonomous Driving Model: Vision + Language + GPS + Speed.

[![Model License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0) [![Base Model](https://img.shields.io/badge/Base%20Model-Qwen2.5--VL-blue)](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) [![Status](https://img.shields.io/badge/Status-Active-success.svg)]()
--- ## 🚀 Model Overview **DriveFusion-V0.2** is a multimodal model designed for autonomous vehicle applications. Unlike standard Vision-Language models, V0.2 integrates **telemetry data** (GPS and Speed) directly into the transformer architecture to perform dual tasks: 1. **Natural Language Reasoning**: Describing scenes and explaining driving decisions. 2. **Trajectory & Speed Prediction**: Outputting coordinates for future waypoints and target velocity profiles. Built on the **Qwen2.5-VL** foundation, DriveFusion-V0.2 adds specialized MLP heads to fuse physical context with visual features, enabling a comprehensive "world model" for driving. ## 🔗 GitHub Repository Find the full implementation, training scripts, and preprocessing logic here: * **Main Model Code:** [DriveFusion/drivefusion](https://github.com/DriveFusion/drivefusion) * **Data Collection:** [DriveFusion/data-collection](https://github.com/DriveFusion/carla-data-collection.git) ### Core Features - **Vision Processing**: Handles images and videos via a 32-layer Vision Transformer. - **Context Fusion**: Custom `SpeedMLP` and `GPSTargetPointsMLP` integrate vehicle telemetry. - **Predictive Heads**: Generates **20 trajectory waypoints** and **10 target speed values**. - **Reasoning**: Full natural language generation for "Chain of Thought" driving explanations. --- ## 🏗 Architecture DriveFusion-V0.2 extends the Qwen2.5-VL architecture with a modular "Driving Intelligence" layer.
DriveFusion Architecture

The DriveFusion-V0.2 Architecture: Integrating visual tokens with telemetry-encoded tokens for dual-head output.

### Technical Specifications - **Text Encoder**: Qwen2.5-VL (36 Transformer layers). - **Vision Encoder**: 32-layer ViT with configurable patch sizes. - **Driving MLPs**: - `TrajectoryMLP`: Generates batch × 20 × 2 coordinates. - `TargetSpeedMLP`: Generates batch × 10 × 2 velocity values. - **Context Window**: 128k tokens. --- ## 🚀 Quick Start Using DriveFusion-V0.2 requires the custom `DriveFusionProcessor` to handle the fusion of text, images, and telemetry. ### Installation ```bash pip install transformers accelerate qwen-vl-utils torch ``` ### Inference Example ```python import torch from drivefusion import DriveFusionForConditionalGeneration, DriveFusionProcessor # Load Model model_id = "DriveFusion/DriveFusion-V0.2" model = DriveFusionForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") processor = DriveFusionProcessor.from_pretrained(model_id) # Define Input: Image + Prompt + Telemetry gps_context = [[40.7128, -74.0060], [40.7130, -74.0058]] # Lat/Lon history speed_context = [[30.5]] # Current speed in m/s message = [{ "role": "user", "content": [ {"type": "image", "image": "highway_scene.jpg"}, {"type": "text", "text": "Analyze the scene and predict the next trajectory based on our current speed."} ] }] # Generate inputs = processor(text=message, images="highway_scene.jpg", gps=gps_context, speed=speed_context, return_tensors="pt").to("cuda") output = model.generate(**inputs, max_new_tokens=512) # Results print("Reasoning:", output["text"]) print("Predicted Trajectory (20 pts):", output["trajectory"]) print("Target Speeds:", output["target_speeds"]) ``` --- ## 🛠 Intended Use - **End-to-End Autonomous Driving**: Acting as a primary planner or a redundant safety checker. - **Explainable AI (XAI)**: Providing human-readable justifications for automated maneuvers. - **Sim-to-Real Transfer**: Using the model as a sophisticated "expert" driver in simulated environments.