File size: 4,971 Bytes
2fe35b3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 | ---
language:
- en
tags:
- autonomous-vehicles
- driving
- representation-learning
- multi-task-learning
- computer-vision
- safety
license: mit
---
# DriveBench: General-Purpose Driving Scene Encoder
**Author:** Nikhil Upadhyay | MSc Business Analytics | Dublin Business School
**Project:** [PRECOG-AV](https://github.com/TrazeMaG/PRECOG-AV)
## Overview
DriveBench is the first general-purpose driving scene encoder trained with
safety-focused multi-task supervision across **25 countries and 298,326 real
driving clips** β the largest geographic scale in driving representation learning.
Each clip is encoded into a **256-dimensional DriveBench embedding** that
simultaneously captures danger context, geographic driving patterns,
time-of-day risk, radar sensor health, and traffic density.
Use these embeddings like ImageNet features β but for driving scenes.
## Results
| Task | Metric | Score | Random Baseline |
|------|--------|-------|-----------------|
| Danger Anticipation | AUC | **0.8385** | 0.500 |
| Geographic Region | Accuracy | **0.4438** | 0.167 (6 classes) |
| Time of Day | Accuracy | **0.5168** | 0.250 (4 classes) |
| Radar Health | AUC | **1.0000** | 0.500 |
| TTC Regression | Pearson r | **0.3009** | 0.000 |
Tested on Greece and Bulgaria β countries never seen during training.
## What makes this different
All existing driving pre-training (DriveWorld, DriveTok, GASP) uses geometric
proxy tasks β depth prediction, occupancy, reconstruction β on 1 to 3 cities.
DriveBench uses **safety-relevant supervision signals** across **25 countries**:
- Danger labels from physics-based TTC analysis (not manual annotation)
- Radar sensor health as a training signal
- Geographic region (6 regions, 25 countries)
- Time-of-day risk patterns (peak danger 13:00-15:00 confirmed)
- Traffic density
## Architecture
ViT-B/16 features (5 frames Γ 768-dim)
β
TransformerEncoder (3 layers, 8 heads, 2048 FFN)
β
DriveBench Embedding (256-dim) β use this downstream
β
5 multi-task heads:
Danger head β AUC 0.84
Region head β Acc 0.44 (6 regions)
Time-of-day β Acc 0.52 (4 buckets)
Radar head β AUC 1.00
TTC regression β r = 0.30
## Usage
```python
import torch
import torch.nn as nn
from huggingface_hub import hf_hub_download
class DriveBenchModel(nn.Module):
def __init__(self, embed_dim=256, n_frames=5, n_regions=6):
super().__init__()
self.cls_token = nn.Parameter(torch.randn(1,1,768))
self.pos_embed = nn.Embedding(n_frames+1, 768)
layer = nn.TransformerEncoderLayer(
d_model=768, nhead=8, dim_feedforward=2048,
dropout=0.1, batch_first=True, norm_first=True)
self.transformer = nn.TransformerEncoder(layer, num_layers=3)
self.norm = nn.LayerNorm(768)
self.projector = nn.Sequential(
nn.Linear(768,512), nn.GELU(), nn.Dropout(0.15),
nn.Linear(512,embed_dim), nn.LayerNorm(embed_dim))
def encode(self, x):
B = x.shape[0]
cls = self.cls_token.expand(B,-1,-1)
x = torch.cat([cls,x],dim=1)
pos = torch.arange(x.shape[1], device=x.device)
x = x + self.pos_embed(pos)
x = self.norm(self.transformer(x))
return self.projector(x[:,0])
path = hf_hub_download("Trazemag/DriveBench", "drivebench_best.pt")
model = DriveBenchModel()
ckpt = torch.load(path, map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state"])
model.eval()
# Input: (batch, 5, 768) ViT-B/16 features from 5 consecutive frames
# Output: (batch, 256) DriveBench embedding
# Use as features for any downstream driving task
```
## Pre-computed Embeddings
298,326 embeddings already computed β download and use directly:
```python
import numpy as np
from huggingface_hub import hf_hub_download
path = hf_hub_download(
"Trazemag/DriveBench-Embeddings",
"drivebench_embeddings.npz",
repo_type="dataset")
data = np.load(path)
embeddings = data["embeddings"] # (298326, 256)
```
## Training Data
Built on the [NVIDIA PhysicalAI-AV](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles)
dataset (gated β request access at HuggingFace).
Danger labels available at [Trazemag/PRECOG-Labels](https://huggingface.co/datasets/Trazemag/PRECOG-Labels).
## Related Models
| Model | Task | Link |
|-------|------|------|
| PRECOG-SENSE | Radar health from camera | [Trazemag/PRECOG-SENSE](https://huggingface.co/Trazemag/PRECOG-SENSE) |
| PRECOG-HERALD | Danger anticipation | [Trazemag/PRECOG-HERALD](https://huggingface.co/Trazemag/PRECOG-HERALD) |
| DriveBench | General scene encoder | This model |
## Citation
```bibtex
@misc{upadhyay2026drivebench,
title = {DriveBench: General-Purpose Driving Scene Encoder
via Multi-Task Safety-Focused Pre-training across 25 Countries},
author = {Upadhyay, Nikhil},
year = {2026},
url = {https://github.com/TrazeMaG/PRECOG-AV}
}
``` |