File size: 4,971 Bytes

2fe35b3

---
language:
- en
tags:
- autonomous-vehicles
- driving
- representation-learning
- multi-task-learning
- computer-vision
- safety
license: mit
---

# DriveBench: General-Purpose Driving Scene Encoder

**Author:** Nikhil Upadhyay | MSc Business Analytics | Dublin Business School
**Project:** [PRECOG-AV](https://github.com/TrazeMaG/PRECOG-AV)

## Overview

DriveBench is the first general-purpose driving scene encoder trained with
safety-focused multi-task supervision across **25 countries and 298,326 real
driving clips** — the largest geographic scale in driving representation learning.

Each clip is encoded into a **256-dimensional DriveBench embedding** that
simultaneously captures danger context, geographic driving patterns,
time-of-day risk, radar sensor health, and traffic density.
Use these embeddings like ImageNet features — but for driving scenes.

## Results

| Task | Metric | Score | Random Baseline |
|------|--------|-------|-----------------|
| Danger Anticipation | AUC | **0.8385** | 0.500 |
| Geographic Region | Accuracy | **0.4438** | 0.167 (6 classes) |
| Time of Day | Accuracy | **0.5168** | 0.250 (4 classes) |
| Radar Health | AUC | **1.0000** | 0.500 |
| TTC Regression | Pearson r | **0.3009** | 0.000 |

Tested on Greece and Bulgaria — countries never seen during training.

## What makes this different

All existing driving pre-training (DriveWorld, DriveTok, GASP) uses geometric
proxy tasks — depth prediction, occupancy, reconstruction — on 1 to 3 cities.

DriveBench uses **safety-relevant supervision signals** across **25 countries**:
- Danger labels from physics-based TTC analysis (not manual annotation)
- Radar sensor health as a training signal
- Geographic region (6 regions, 25 countries)
- Time-of-day risk patterns (peak danger 13:00-15:00 confirmed)
- Traffic density

## Architecture
ViT-B/16 features (5 frames × 768-dim)

↓

TransformerEncoder (3 layers, 8 heads, 2048 FFN)

↓

DriveBench Embedding (256-dim)  ← use this downstream

↓

5 multi-task heads:

Danger head     → AUC 0.84

Region head     → Acc 0.44 (6 regions)

Time-of-day     → Acc 0.52 (4 buckets)

Radar head      → AUC 1.00

TTC regression  → r = 0.30

## Usage

```python
import torch
import torch.nn as nn
from huggingface_hub import hf_hub_download

class DriveBenchModel(nn.Module):
    def __init__(self, embed_dim=256, n_frames=5, n_regions=6):
        super().__init__()
        self.cls_token = nn.Parameter(torch.randn(1,1,768))
        self.pos_embed = nn.Embedding(n_frames+1, 768)
        layer = nn.TransformerEncoderLayer(
            d_model=768, nhead=8, dim_feedforward=2048,
            dropout=0.1, batch_first=True, norm_first=True)
        self.transformer = nn.TransformerEncoder(layer, num_layers=3)
        self.norm = nn.LayerNorm(768)
        self.projector = nn.Sequential(
            nn.Linear(768,512), nn.GELU(), nn.Dropout(0.15),
            nn.Linear(512,embed_dim), nn.LayerNorm(embed_dim))

    def encode(self, x):
        B = x.shape[0]
        cls = self.cls_token.expand(B,-1,-1)
        x = torch.cat([cls,x],dim=1)
        pos = torch.arange(x.shape[1], device=x.device)
        x = x + self.pos_embed(pos)
        x = self.norm(self.transformer(x))
        return self.projector(x[:,0])

path = hf_hub_download("Trazemag/DriveBench", "drivebench_best.pt")
model = DriveBenchModel()
ckpt = torch.load(path, map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state"])
model.eval()

# Input:  (batch, 5, 768) ViT-B/16 features from 5 consecutive frames
# Output: (batch, 256) DriveBench embedding
# Use as features for any downstream driving task
```

## Pre-computed Embeddings

298,326 embeddings already computed — download and use directly:

```python
import numpy as np
from huggingface_hub import hf_hub_download

path = hf_hub_download(
    "Trazemag/DriveBench-Embeddings",
    "drivebench_embeddings.npz",
    repo_type="dataset")
data = np.load(path)
embeddings = data["embeddings"]  # (298326, 256)
```

## Training Data

Built on the [NVIDIA PhysicalAI-AV](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles)
dataset (gated — request access at HuggingFace).

Danger labels available at [Trazemag/PRECOG-Labels](https://huggingface.co/datasets/Trazemag/PRECOG-Labels).

## Related Models

| Model | Task | Link |
|-------|------|------|
| PRECOG-SENSE | Radar health from camera | [Trazemag/PRECOG-SENSE](https://huggingface.co/Trazemag/PRECOG-SENSE) |
| PRECOG-HERALD | Danger anticipation | [Trazemag/PRECOG-HERALD](https://huggingface.co/Trazemag/PRECOG-HERALD) |
| DriveBench | General scene encoder | This model |

## Citation

```bibtex
@misc{upadhyay2026drivebench,
  title  = {DriveBench: General-Purpose Driving Scene Encoder
            via Multi-Task Safety-Focused Pre-training across 25 Countries},
  author = {Upadhyay, Nikhil},
  year   = {2026},
  url    = {https://github.com/TrazeMaG/PRECOG-AV}
}
```