File size: 4,971 Bytes
2fe35b3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
language:
- en
tags:
- autonomous-vehicles
- driving
- representation-learning
- multi-task-learning
- computer-vision
- safety
license: mit
---

# DriveBench: General-Purpose Driving Scene Encoder

**Author:** Nikhil Upadhyay | MSc Business Analytics | Dublin Business School
**Project:** [PRECOG-AV](https://github.com/TrazeMaG/PRECOG-AV)

## Overview

DriveBench is the first general-purpose driving scene encoder trained with
safety-focused multi-task supervision across **25 countries and 298,326 real
driving clips** β€” the largest geographic scale in driving representation learning.

Each clip is encoded into a **256-dimensional DriveBench embedding** that
simultaneously captures danger context, geographic driving patterns,
time-of-day risk, radar sensor health, and traffic density.
Use these embeddings like ImageNet features β€” but for driving scenes.

## Results

| Task | Metric | Score | Random Baseline |
|------|--------|-------|-----------------|
| Danger Anticipation | AUC | **0.8385** | 0.500 |
| Geographic Region | Accuracy | **0.4438** | 0.167 (6 classes) |
| Time of Day | Accuracy | **0.5168** | 0.250 (4 classes) |
| Radar Health | AUC | **1.0000** | 0.500 |
| TTC Regression | Pearson r | **0.3009** | 0.000 |

Tested on Greece and Bulgaria β€” countries never seen during training.

## What makes this different

All existing driving pre-training (DriveWorld, DriveTok, GASP) uses geometric
proxy tasks β€” depth prediction, occupancy, reconstruction β€” on 1 to 3 cities.

DriveBench uses **safety-relevant supervision signals** across **25 countries**:
- Danger labels from physics-based TTC analysis (not manual annotation)
- Radar sensor health as a training signal
- Geographic region (6 regions, 25 countries)
- Time-of-day risk patterns (peak danger 13:00-15:00 confirmed)
- Traffic density

## Architecture
ViT-B/16 features (5 frames Γ— 768-dim)

↓

TransformerEncoder (3 layers, 8 heads, 2048 FFN)

↓

DriveBench Embedding (256-dim)  ← use this downstream

↓

5 multi-task heads:

Danger head     β†’ AUC 0.84

Region head     β†’ Acc 0.44 (6 regions)

Time-of-day     β†’ Acc 0.52 (4 buckets)

Radar head      β†’ AUC 1.00

TTC regression  β†’ r = 0.30

## Usage

```python
import torch
import torch.nn as nn
from huggingface_hub import hf_hub_download

class DriveBenchModel(nn.Module):
    def __init__(self, embed_dim=256, n_frames=5, n_regions=6):
        super().__init__()
        self.cls_token = nn.Parameter(torch.randn(1,1,768))
        self.pos_embed = nn.Embedding(n_frames+1, 768)
        layer = nn.TransformerEncoderLayer(
            d_model=768, nhead=8, dim_feedforward=2048,
            dropout=0.1, batch_first=True, norm_first=True)
        self.transformer = nn.TransformerEncoder(layer, num_layers=3)
        self.norm = nn.LayerNorm(768)
        self.projector = nn.Sequential(
            nn.Linear(768,512), nn.GELU(), nn.Dropout(0.15),
            nn.Linear(512,embed_dim), nn.LayerNorm(embed_dim))

    def encode(self, x):
        B = x.shape[0]
        cls = self.cls_token.expand(B,-1,-1)
        x = torch.cat([cls,x],dim=1)
        pos = torch.arange(x.shape[1], device=x.device)
        x = x + self.pos_embed(pos)
        x = self.norm(self.transformer(x))
        return self.projector(x[:,0])

path = hf_hub_download("Trazemag/DriveBench", "drivebench_best.pt")
model = DriveBenchModel()
ckpt = torch.load(path, map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state"])
model.eval()

# Input:  (batch, 5, 768) ViT-B/16 features from 5 consecutive frames
# Output: (batch, 256) DriveBench embedding
# Use as features for any downstream driving task
```

## Pre-computed Embeddings

298,326 embeddings already computed β€” download and use directly:

```python
import numpy as np
from huggingface_hub import hf_hub_download

path = hf_hub_download(
    "Trazemag/DriveBench-Embeddings",
    "drivebench_embeddings.npz",
    repo_type="dataset")
data = np.load(path)
embeddings = data["embeddings"]  # (298326, 256)
```

## Training Data

Built on the [NVIDIA PhysicalAI-AV](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles)
dataset (gated β€” request access at HuggingFace).

Danger labels available at [Trazemag/PRECOG-Labels](https://huggingface.co/datasets/Trazemag/PRECOG-Labels).

## Related Models

| Model | Task | Link |
|-------|------|------|
| PRECOG-SENSE | Radar health from camera | [Trazemag/PRECOG-SENSE](https://huggingface.co/Trazemag/PRECOG-SENSE) |
| PRECOG-HERALD | Danger anticipation | [Trazemag/PRECOG-HERALD](https://huggingface.co/Trazemag/PRECOG-HERALD) |
| DriveBench | General scene encoder | This model |

## Citation

```bibtex
@misc{upadhyay2026drivebench,
  title  = {DriveBench: General-Purpose Driving Scene Encoder
            via Multi-Task Safety-Focused Pre-training across 25 Countries},
  author = {Upadhyay, Nikhil},
  year   = {2026},
  url    = {https://github.com/TrazeMaG/PRECOG-AV}
}
```