Instructions to use ThaoTran7/streetvision-roadwork-ensemble with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ThaoTran7/streetvision-roadwork-ensemble with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-classification", model="ThaoTran7/streetvision-roadwork-ensemble") pipe("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parrots.png")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ThaoTran7/streetvision-roadwork-ensemble", dtype="auto") - Notebooks
- Google Colab
- Kaggle
StreetVision Roadwork Detector โ Architecture-Diversity Ensemble
Production model used for mining on Bittensor Subnet 72 (StreetVision / NATIX).
This repository hosts a 2-model architecture-diversity ensemble that classifies whether a street-view image contains roadwork (cones, drums, vertical panels, work vehicles, TTC signs, barriers, workers, etc.). The averaged Roadwork probability of both members is used at inference time, with a calibrated threshold remapping so the average's optimum (0.72) maps to the validator's fixed 0.5.
Members
| Subfolder | Backbone | Pretraining | Input | Params |
|---|---|---|---|---|
convnextv2-base/ |
facebook/convnextv2-base-22k-224 |
ImageNet-22k supervised | 224ร224 | 87M |
dinov2-base/ |
facebook/dinov2-with-registers-base-imagenet1k-1-layer |
LVD-142M self-supervised + IN-1k linear | 224ร224 | 87M |
Both members were fine-tuned on natix-network-org/roadwork (5,625 train / 626 val) with identical training recipes (12 epochs, lr 5e-5, weight-decay 0.05, warmup 10%, label smoothing 0.05, class weighting for None/Roadwork imbalance, bf16, validator-mirroring augmentations, training-matched letterbox preprocessing).
Headline results
Balanced 50/50 (235 None / 235 Roadwork) test split with letterbox preprocessing (training-matched, no center-crop), validator decision threshold 0.5 after monotonic calibration:
| Config | MCC | Acc | Spec | Recall | FP | FN |
|---|---|---|---|---|---|---|
convnextv2-base alone (cal 0.70) |
0.8634 | 0.9277 | 0.860 | 0.996 | 33 | 1 |
dinov2-base alone (cal 0.65) |
0.8596 | 0.9255 | 0.855 | 0.996 | 34 | 1 |
| Ensemble (cal 0.72) | 0.8710 | 0.9319 | 0.868 | 0.996 | 31 | 1 |
ROC-AUC of dinov2-base (0.961) is meaningfully higher than convnextv2-base/swinv2-base (~0.93), so the SSL features rank borderline cases more accurately. Probability correlation between members is r=0.97; the residual 0.03 of independent signal is exactly enough to fix the 2 lowest-confidence false positives.
Why architecture diversity matters here
Same-architecture seed ensembles (e.g. ConvNeXtV2 seed=42 + ConvNeXtV2 seed=1337) produced no MCC gain โ predictions were too correlated. Same-paradigm cross-architecture ensembles (ConvNeXtV2 + SwinV2, both supervised IN-22k) gained only +0.001 MCC, within noise. The +0.0076 MCC gain only materialised once we paired a supervised backbone with a self-supervised backbone whose feature space was learned with a fundamentally different objective.
Inference
The full inference pipeline (letterbox preprocessing + ensemble averaging + threshold calibration) is implemented in base_miner/detectors/vit_detector.py of the SN72 miner repo, configured by ConvNextV2_DINOv2_ensemble.yaml (also included in this repository at the root).
Standalone usage:
import torch
from PIL import Image
from torchvision import transforms
from torchvision.transforms import functional as TF
from transformers import AutoImageProcessor, AutoModelForImageClassification
REPO = "ThaoTran7/streetvision-roadwork-ensemble"
class LetterboxTo224:
def __call__(self, img):
img = img.convert("RGB")
w, h = img.size
s = min(224 / w, 224 / h)
new_w, new_h = max(int(w * s), 1), max(int(h * s), 1)
resized = TF.resize(img, [new_h, new_w])
pl, pt = (224 - new_w) // 2, (224 - new_h) // 2
return TF.pad(resized, [pl, pt, 224 - new_w - pl, 224 - new_h - pt], fill=0)
def build_member(subfolder):
model = AutoModelForImageClassification.from_pretrained(REPO, subfolder=subfolder).eval()
proc = AutoImageProcessor.from_pretrained(REPO, subfolder=subfolder, use_fast=True)
tf = transforms.Compose([
LetterboxTo224(),
transforms.ToTensor(),
transforms.Normalize(mean=proc.image_mean, std=proc.image_std),
])
return model, tf
m1, tf1 = build_member("convnextv2-base")
m2, tf2 = build_member("dinov2-base")
@torch.no_grad()
def predict(img):
p1 = torch.softmax(m1(pixel_values=tf1(img).unsqueeze(0)).logits, dim=-1)[0, 1].item()
p2 = torch.softmax(m2(pixel_values=tf2(img).unsqueeze(0)).logits, dim=-1)[0, 1].item()
p = (p1 + p2) / 2.0
# calibration: map model-optimal 0.72 -> validator-effective 0.5
if p <= 0.72:
return p * 0.5 / 0.72
return 0.5 + (p - 0.72) * 0.5 / (1 - 0.72)
print(predict(Image.open("test.jpg")))
License
Apache 2.0. Base weights are subject to the licenses of facebook/convnextv2-base-22k-224 and facebook/dinov2-with-registers-base-imagenet1k-1-layer.