YOLOv11s · VisDrone object detector
Small YOLOv11 object detector fine-tuned on the VisDrone2019 dataset,
targeting aerial/surveillance-angle traffic footage — the exact viewpoint
used by our telecom_videos set (fixed overhead/elevated cameras).
Part of the ktk-studio traffic-violation analytics stack.
Summary
| Architecture | YOLOv11s (Ultralytics) |
| Input | 640×640 RGB |
| Output | output0 shape [1, 14, N] — 4 bbox + 10 class scores |
| Parameters | 9.5 M |
| Weights | best.pt (19 MB) / best.onnx (37 MB, opset 19) |
| Epochs trained | 67 (early-stop from 80) |
| Best epoch | 52 |
| Val mAP50 | 0.377 |
| Val mAP50-95 | 0.219 |
| Val Precision | 0.502 |
| Val Recall | 0.388 |
Classes
pedestrian, people, bicycle, car, van, truck, tricycle,
awning-tricycle, bus, motor — matching the VisDrone2019 taxonomy.
Training
Dataset auto-downloaded via data=VisDrone.yaml (Ultralytics).
yolo detect train data=VisDrone.yaml model=yolo11s.pt \
epochs=80 imgsz=640 batch=32 patience=15 device=0
Hardware: NVIDIA B200 180GB. Batch 32, ~35GB VRAM. Training time ~50 min.
Usage
Ultralytics
from ultralytics import YOLO
model = YOLO("best.pt")
r = model("traffic_camera_frame.jpg", conf=0.25)
r[0].show()
ONNX Runtime (GPU)
import cv2, numpy as np, onnxruntime as ort
sess = ort.InferenceSession("best.onnx", providers=["CUDAExecutionProvider"])
img = cv2.imread("frame.jpg")
# letterbox to 640×640, RGB, [0,1], CHW
scale = min(640/img.shape[1], 640/img.shape[0])
nw, nh = int(img.shape[1]*scale), int(img.shape[0]*scale)
r = cv2.resize(img, (nw, nh))
canvas = np.full((640, 640, 3), 114, np.uint8)
pad_x, pad_y = (640-nw)//2, (640-nh)//2
canvas[pad_y:pad_y+nh, pad_x:pad_x+nw] = r
x = cv2.cvtColor(canvas, cv2.COLOR_BGR2RGB).astype(np.float32)/255.0
x = np.ascontiguousarray(x.transpose(2,0,1)[None])
y = sess.run(None, {"images": x})[0][0] # (14, N)
# decode: y[:4] = cx,cy,w,h; y[4:14] = class scores
Intended use
- Real-time object detection on traffic-camera / drone-style footage.
- Drop-in replacement for TrafficCamNet when the scene has van/bus/truck/people
that need to be distinguished (TrafficCamNet collapses them into
car).
Notes
- Short training (67 epochs) leaves room for improvement — long training schedules with stronger augmentation can push mAP50 above 0.5.
- VisDrone val set contains many tiny / occluded objects; low mAP50-95 is typical for this domain and roughly matches published baselines on VisDrone.
- For the production stack we deploy this model through Triton (ONNX Runtime backend) and pair with a light Python IoU tracker to get per-object IDs.
License
AGPL-3.0 (inherits Ultralytics YOLOv11 weight license).
- Downloads last month
- 79