| --- |
| language: |
| - en |
| tags: |
| - autonomous-vehicles |
| - driving |
| - representation-learning |
| - multi-task-learning |
| - computer-vision |
| - safety |
| license: mit |
| --- |
| |
| # DriveBench: General-Purpose Driving Scene Encoder |
|
|
| **Author:** Nikhil Upadhyay | MSc Business Analytics | Dublin Business School |
| **Project:** [PRECOG-AV](https://github.com/TrazeMaG/PRECOG-AV) |
|
|
| ## Overview |
|
|
| DriveBench is the first general-purpose driving scene encoder trained with |
| safety-focused multi-task supervision across **25 countries and 298,326 real |
| driving clips** β the largest geographic scale in driving representation learning. |
|
|
| Each clip is encoded into a **256-dimensional DriveBench embedding** that |
| simultaneously captures danger context, geographic driving patterns, |
| time-of-day risk, radar sensor health, and traffic density. |
| Use these embeddings like ImageNet features β but for driving scenes. |
|
|
| ## Results |
|
|
| | Task | Metric | Score | Random Baseline | |
| |------|--------|-------|-----------------| |
| | Danger Anticipation | AUC | **0.8385** | 0.500 | |
| | Geographic Region | Accuracy | **0.4438** | 0.167 (6 classes) | |
| | Time of Day | Accuracy | **0.5168** | 0.250 (4 classes) | |
| | Radar Health | AUC | **1.0000** | 0.500 | |
| | TTC Regression | Pearson r | **0.3009** | 0.000 | |
|
|
| Tested on Greece and Bulgaria β countries never seen during training. |
|
|
| ## What makes this different |
|
|
| All existing driving pre-training (DriveWorld, DriveTok, GASP) uses geometric |
| proxy tasks β depth prediction, occupancy, reconstruction β on 1 to 3 cities. |
|
|
| DriveBench uses **safety-relevant supervision signals** across **25 countries**: |
| - Danger labels from physics-based TTC analysis (not manual annotation) |
| - Radar sensor health as a training signal |
| - Geographic region (6 regions, 25 countries) |
| - Time-of-day risk patterns (peak danger 13:00-15:00 confirmed) |
| - Traffic density |
|
|
| ## Architecture |
| ViT-B/16 features (5 frames Γ 768-dim) |
|
|
| β |
|
|
| TransformerEncoder (3 layers, 8 heads, 2048 FFN) |
|
|
| β |
|
|
| DriveBench Embedding (256-dim) β use this downstream |
|
|
| β |
|
|
| 5 multi-task heads: |
|
|
| Danger head β AUC 0.84 |
|
|
| Region head β Acc 0.44 (6 regions) |
|
|
| Time-of-day β Acc 0.52 (4 buckets) |
|
|
| Radar head β AUC 1.00 |
|
|
| TTC regression β r = 0.30 |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| import torch.nn as nn |
| from huggingface_hub import hf_hub_download |
| |
| class DriveBenchModel(nn.Module): |
| def __init__(self, embed_dim=256, n_frames=5, n_regions=6): |
| super().__init__() |
| self.cls_token = nn.Parameter(torch.randn(1,1,768)) |
| self.pos_embed = nn.Embedding(n_frames+1, 768) |
| layer = nn.TransformerEncoderLayer( |
| d_model=768, nhead=8, dim_feedforward=2048, |
| dropout=0.1, batch_first=True, norm_first=True) |
| self.transformer = nn.TransformerEncoder(layer, num_layers=3) |
| self.norm = nn.LayerNorm(768) |
| self.projector = nn.Sequential( |
| nn.Linear(768,512), nn.GELU(), nn.Dropout(0.15), |
| nn.Linear(512,embed_dim), nn.LayerNorm(embed_dim)) |
| |
| def encode(self, x): |
| B = x.shape[0] |
| cls = self.cls_token.expand(B,-1,-1) |
| x = torch.cat([cls,x],dim=1) |
| pos = torch.arange(x.shape[1], device=x.device) |
| x = x + self.pos_embed(pos) |
| x = self.norm(self.transformer(x)) |
| return self.projector(x[:,0]) |
| |
| path = hf_hub_download("Trazemag/DriveBench", "drivebench_best.pt") |
| model = DriveBenchModel() |
| ckpt = torch.load(path, map_location="cpu", weights_only=False) |
| model.load_state_dict(ckpt["model_state"]) |
| model.eval() |
| |
| # Input: (batch, 5, 768) ViT-B/16 features from 5 consecutive frames |
| # Output: (batch, 256) DriveBench embedding |
| # Use as features for any downstream driving task |
| ``` |
|
|
| ## Pre-computed Embeddings |
|
|
| 298,326 embeddings already computed β download and use directly: |
|
|
| ```python |
| import numpy as np |
| from huggingface_hub import hf_hub_download |
| |
| path = hf_hub_download( |
| "Trazemag/DriveBench-Embeddings", |
| "drivebench_embeddings.npz", |
| repo_type="dataset") |
| data = np.load(path) |
| embeddings = data["embeddings"] # (298326, 256) |
| ``` |
|
|
| ## Training Data |
|
|
| Built on the [NVIDIA PhysicalAI-AV](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles) |
| dataset (gated β request access at HuggingFace). |
|
|
| Danger labels available at [Trazemag/PRECOG-Labels](https://huggingface.co/datasets/Trazemag/PRECOG-Labels). |
|
|
| ## Related Models |
|
|
| | Model | Task | Link | |
| |-------|------|------| |
| | PRECOG-SENSE | Radar health from camera | [Trazemag/PRECOG-SENSE](https://huggingface.co/Trazemag/PRECOG-SENSE) | |
| | PRECOG-HERALD | Danger anticipation | [Trazemag/PRECOG-HERALD](https://huggingface.co/Trazemag/PRECOG-HERALD) | |
| | DriveBench | General scene encoder | This model | |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{upadhyay2026drivebench, |
| title = {DriveBench: General-Purpose Driving Scene Encoder |
| via Multi-Task Safety-Focused Pre-training across 25 Countries}, |
| author = {Upadhyay, Nikhil}, |
| year = {2026}, |
| url = {https://github.com/TrazeMaG/PRECOG-AV} |
| } |
| ``` |