Geo-trax: YOLOv8s Vehicle Detector for Drone BEV Imagery

This is the default detection model for Geo-trax, a comprehensive pipeline for extracting georeferenced vehicle trajectories from high-altitude drone (bird's-eye view) video footage. The model detects vehicles in aerial imagery and underpins the results reported in the associated publication.

🎬 This accelerated animation previews some of the capabilities of Geo-trax. Watch the full demonstration (~4 min) on YouTube.

Model Details

Property	Value
Architecture	YOLOv8s (HBB, horizontal bounding boxes)
Input resolution	1920 × 1920 px
Classes	6 trained (4 primary + 2 auxiliary; see below)
Parameters	~11 M
Framework	Ultralytics ≥ 8.4.64
Trained on	19,339 annotated aerial images (679,306 labeled instances); multi-stage, see publication
Validated on	Songdo Vision test set (1,084 images, 55,124 vehicle instances)

Classes and Detection Performance

Metrics reported on the Songdo Vision test split (1,084 images, 55,124 labeled vehicle instances). The Instances column is the per-class support in the test set. See Table 3 of the publication for full results.

ID	Label	Notes	Instances	Precision	Recall	mAP@50	mAP@50-95
0	Car	incl. vans	49,508	0.979	0.981	0.992	0.835
1	Bus		1,759	0.952	0.977	0.988	0.826
2	Truck		3,052	0.887	0.916	0.935	0.722
3	Motorcycle		805	0.827	0.866	0.888	0.463
4	Pedestrian	not evaluated	n/a	n/a	n/a	n/a	n/a
5	Bicycle	not evaluated	n/a	n/a	n/a	n/a	n/a
All			55,124	0.911	0.935	0.951	0.711

The model reaches 0.951 mAP@50 and 0.711 mAP@50-95 overall, with near-saturated accuracy on cars and buses (mAP@50 ≥ 0.988). Trucks and especially motorcycles are harder: motorcycles are small, sparse in the test set (805 instances), and the main driver of the lower mAP@50-95.

Evaluation Plots

Precision-recall curves and the normalized confusion matrix on the Songdo Vision test set:

_{Precision-Recall Curve}

_{Normalized Confusion Matrix}

Note on pedestrian and bicycle classes: The model was trained on pedestrian and bicycle instances; however, these classes are not evaluated and not recommended for use. They were underrepresented in the training data, are not annotated in the Songdo Vision dataset (making reliable evaluation impossible), and achieve poor detection performance in practice.

How to Use

With Geo-trax (recommended)

This model is the default in Geo-trax and downloads automatically on first use:

pip install geo-trax
geotrax extract video.mp4           # detect, track, and stabilize; auto-downloads the model
geotrax batch   video.mp4 --no-geo  # detect, track, and stabilize; skip georeferencing
geotrax batch   video.mp4           # full pipeline including georeferencing (requires orthophotos)

See the Geo-trax GitHub README for the full pipeline, configuration options, and georeferencing.

Direct Ultralytics inference

from ultralytics import YOLO
from huggingface_hub import hf_hub_download

weights = hf_hub_download(repo_id="rfonod/geo-trax", filename="geotrax_hbb_yolov8s_1920_v1.pt")
model = YOLO(weights)

results = model("drone_frame.jpg", imgsz=1920, conf=0.25, iou=0.45, classes=[0, 1, 2, 3])
results[0].show()

Tip: The model was trained and validated at 1920 px input resolution. Downscaling to 1280 px is possible with a small accuracy trade-off; going below 960 px significantly degrades detection of small vehicles (motorcycles, distant cars). Pass classes=[0, 1, 2, 3] to restrict inference to the four evaluated classes and suppress unreliable predictions.

Training Data

Training followed a multi-stage strategy starting from YOLOv8s weights pretrained on COCO as the initial foundation. Two successive stages were applied:

Stage 1 (BASE): The model was trained on a large, diverse collection drawn from eight public aerial and drone datasets (CARPK, PUCPR+, CyCAR, UAVDT, HARPY, RAI4VD, UIT-ADrone, and VisDrone) combined with the Songdo Vision dataset, totalling 19,339 training images with 679,306 annotations across 6 vehicle classes (car, bus, truck, motorcycle, pedestrian, bicycle).

Stage 2 (FINE): The BASE-trained model was subsequently fine-tuned on a curated, high-quality subset of 9,004 images with 321,368 annotations, emphasising accurate annotations and higher-resolution images, again combined with Songdo Vision, to yield the final weights released here.

Training set composition (annotations per class):

Stage	Images	Annotations	Car	Bus	Truck	Motorcycle	Pedestrian	Bicycle
BASE	19,339	679,306	561,666	15,587	28,830	44,512	24,239	4,472
FINE	9,004	321,368	266,745	8,047	14,305	30,925	1,260	86

Songdo Vision comprises 5,419 annotated drone frames (4,335 training / 1,084 test; 80/20 split) collected during a large-scale urban traffic monitoring experiment in Songdo, South Korea. It covers four primary vehicle classes captured at 140-150 m altitude by DJI Mavic 3 drones, contributing 217,311 training and 55,124 test instances to the totals above.

Training configuration:

Setting	Value
Initialization	YOLOv8s pretrained on COCO
Optimizer	SGD
Learning rate (initial / final factor)	0.01 / 0.01
Momentum	0.937
Weight decay	0.0005
Batch size	8
Early stopping	50-epoch patience
Input resolution	1920 × 1920 px (letterbox padding)
Mixed precision	AMP enabled
Augmentation	random scaling, translation, horizontal flip, mosaic, colour jitter, Gaussian/median blur, grayscale, CLAHE

See the publication for complete dataset statistics, training details, and ablation results.

Intended Use and Limitations

GSD assumption: The bundled Geo-trax config assumes a ground sampling distance (GSD) of ~0.027 m/px (DJI Mavic 3, 4K, 140-150 m altitude). Adjust this value in the config for different hardware or flight altitudes.
Supported classes: Car, bus, truck, and motorcycle (class IDs 0-3). The model was also trained on pedestrian and bicycle instances; however, these classes achieve poor detection performance and are not recommended for use (see the class table above). Geo-trax filters to the four primary classes by default; when using Ultralytics directly, pass classes=[0, 1, 2, 3] to suppress unreliable predictions.

Related datasets and resources

Songdo Traffic: the georeferenced vehicle-trajectory dataset this model helps produce via the Geo-trax pipeline: 10.5281/zenodo.13828384 · HF rfonod/songdo-traffic
Songdo Vision: the vehicle-detection (annotated image) dataset used to train and validate this model: 10.5281/zenodo.13828407 · HF rfonod/songdo-vision
Source video recordings (not open access): 10.5075/EPFL.20.500.14299/253923
Publication: Transportation Research Part C (2025): 10.1016/j.trc.2025.105205 · arXiv:2411.02136
Software: Geo-trax: github.com/rfonod/geo-trax · Zenodo 10.5281/zenodo.12119542 · demo video

Citation

If you use this model, please cite the associated publication:

@article{fonod2025advanced,
  title   = {Advanced computer vision for extracting georeferenced vehicle trajectories from drone imagery},
  author  = {Fonod, Robert and Cho, Haechan and Yeo, Hwasoo and Geroliminis, Nikolas},
  journal = {Transportation Research Part C: Emerging Technologies},
  volume  = {178},
  pages   = {105205},
  year    = {2025},
  doi     = {10.1016/j.trc.2025.105205}
}

If you additionally use the Geo-trax software, please also cite the specific version you used via its Zenodo record. For example, for version 1.0.0:

@software{fonod2026geo-trax,
  author  = {Fonod, Robert},
  title   = {Geo-trax: A Comprehensive Framework for Georeferenced Vehicle Trajectory Extraction from Drone Imagery},
  url     = {https://github.com/rfonod/geo-trax},
  doi     = {10.5281/zenodo.12119542},
  version = {1.0.0},
  year    = {2026}
}

License

This model is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license; see the LICENSE file for the full terms. The Geo-trax codebase is distributed separately under the MIT License.