StreetVision Roadwork Detector โ€” Architecture-Diversity Ensemble

Production model used for mining on Bittensor Subnet 72 (StreetVision / NATIX).

This repository hosts a 2-model architecture-diversity ensemble that classifies whether a street-view image contains roadwork (cones, drums, vertical panels, work vehicles, TTC signs, barriers, workers, etc.). The averaged Roadwork probability of both members is used at inference time, with a calibrated threshold remapping so the average's optimum (0.72) maps to the validator's fixed 0.5.

Members

Subfolder Backbone Pretraining Input Params
convnextv2-base/ facebook/convnextv2-base-22k-224 ImageNet-22k supervised 224ร—224 87M
dinov2-base/ facebook/dinov2-with-registers-base-imagenet1k-1-layer LVD-142M self-supervised + IN-1k linear 224ร—224 87M

Both members were fine-tuned on natix-network-org/roadwork (5,625 train / 626 val) with identical training recipes (12 epochs, lr 5e-5, weight-decay 0.05, warmup 10%, label smoothing 0.05, class weighting for None/Roadwork imbalance, bf16, validator-mirroring augmentations, training-matched letterbox preprocessing).

Headline results

Balanced 50/50 (235 None / 235 Roadwork) test split with letterbox preprocessing (training-matched, no center-crop), validator decision threshold 0.5 after monotonic calibration:

Config MCC Acc Spec Recall FP FN
convnextv2-base alone (cal 0.70) 0.8634 0.9277 0.860 0.996 33 1
dinov2-base alone (cal 0.65) 0.8596 0.9255 0.855 0.996 34 1
Ensemble (cal 0.72) 0.8710 0.9319 0.868 0.996 31 1

ROC-AUC of dinov2-base (0.961) is meaningfully higher than convnextv2-base/swinv2-base (~0.93), so the SSL features rank borderline cases more accurately. Probability correlation between members is r=0.97; the residual 0.03 of independent signal is exactly enough to fix the 2 lowest-confidence false positives.

Why architecture diversity matters here

Same-architecture seed ensembles (e.g. ConvNeXtV2 seed=42 + ConvNeXtV2 seed=1337) produced no MCC gain โ€” predictions were too correlated. Same-paradigm cross-architecture ensembles (ConvNeXtV2 + SwinV2, both supervised IN-22k) gained only +0.001 MCC, within noise. The +0.0076 MCC gain only materialised once we paired a supervised backbone with a self-supervised backbone whose feature space was learned with a fundamentally different objective.

Inference

The full inference pipeline (letterbox preprocessing + ensemble averaging + threshold calibration) is implemented in base_miner/detectors/vit_detector.py of the SN72 miner repo, configured by ConvNextV2_DINOv2_ensemble.yaml (also included in this repository at the root).

Standalone usage:

import torch
from PIL import Image
from torchvision import transforms
from torchvision.transforms import functional as TF
from transformers import AutoImageProcessor, AutoModelForImageClassification

REPO = "ThaoTran7/streetvision-roadwork-ensemble"

class LetterboxTo224:
    def __call__(self, img):
        img = img.convert("RGB")
        w, h = img.size
        s = min(224 / w, 224 / h)
        new_w, new_h = max(int(w * s), 1), max(int(h * s), 1)
        resized = TF.resize(img, [new_h, new_w])
        pl, pt = (224 - new_w) // 2, (224 - new_h) // 2
        return TF.pad(resized, [pl, pt, 224 - new_w - pl, 224 - new_h - pt], fill=0)

def build_member(subfolder):
    model = AutoModelForImageClassification.from_pretrained(REPO, subfolder=subfolder).eval()
    proc = AutoImageProcessor.from_pretrained(REPO, subfolder=subfolder, use_fast=True)
    tf = transforms.Compose([
        LetterboxTo224(),
        transforms.ToTensor(),
        transforms.Normalize(mean=proc.image_mean, std=proc.image_std),
    ])
    return model, tf

m1, tf1 = build_member("convnextv2-base")
m2, tf2 = build_member("dinov2-base")

@torch.no_grad()
def predict(img):
    p1 = torch.softmax(m1(pixel_values=tf1(img).unsqueeze(0)).logits, dim=-1)[0, 1].item()
    p2 = torch.softmax(m2(pixel_values=tf2(img).unsqueeze(0)).logits, dim=-1)[0, 1].item()
    p = (p1 + p2) / 2.0
    # calibration: map model-optimal 0.72 -> validator-effective 0.5
    if p <= 0.72:
        return p * 0.5 / 0.72
    return 0.5 + (p - 0.72) * 0.5 / (1 - 0.72)

print(predict(Image.open("test.jpg")))

License

Apache 2.0. Base weights are subject to the licenses of facebook/convnextv2-base-22k-224 and facebook/dinov2-with-registers-base-imagenet1k-1-layer.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support