Improving RT-DETRv4 Person and Head Detection on CrowdHuman with C-RADIOv4 Distillation

Abstract

Real-time person and head detection in dense crowds remains fundamentally difficult, as detectors must localize partially visible bodies under severe inter-person occlusion while preserving the fine-grained semantics needed to separate tiny heads in heavily cluttered scenes. Recent real-time detection transformers such as RT-DETRv4 address feature degradation in lightweight detectors by distilling knowledge from vision foundation models (VFMs), yet the impact of teacher choice on detection performance is underexplored. In this paper, we benchmark RT-DETRv4-S against YOLO26-S, YOLOv8-S, and RF-DETR-S on CrowdHuman visible-person and head detection, and investigate whether replacing the default DINOv3-Base teacher with C-RADIOv4-SO400M can further improve detection quality. Rather than scaling the DINOv3 teacher from the default Base model to a much larger DINOv3-7B variant, we study an alternative design in which the student is supervised by C-RADIOv4, an agglomerative vision backbone that already distills knowledge from DINOv3-7B together with complementary teachers such as SAM3 and SigLIP2. RT-DETRv4-S already surpasses all three baselines under both teachers. With C-RADIOv4, it achieves the best mAP on both visible-person (0.8410) and head detection (0.7881), improving over DINOv3 by +0.91 pp and +1.71 pp, respectively, and outperforming the Ultralytics YOLO26-S baseline by +2.47 pp and +2.91 pp.

For detailed methodology, training protocol,evaluation protocol, please refer to Technical Report.

Evaluation and Results

This release focuses on compact visible-person and head detection on CrowdHuman. The study uses CrowdHuman visible-body and head annotations, trains on the training split, and evaluates on the validation split. RT-DETRv4-S is compared against YOLO26-S, YOLOv8-S, and RF-DETR-S, and the teacher study isolates the effect of replacing the default DINOv3-Base teacher with C-RADIOv4-SO400M while keeping the RT-DETRv4-S detector fixed.

Across the evaluated compact models, RT-DETRv4-S with C-RADIOv4-SO400M is the strongest configuration on both visible-person and head detection. The teacher swap improves visible-person mAP from 0.8319 to 0.8410 and head mAP from 0.7710 to 0.7881. Because RT-DETRv4 uses distillation only during training, the DINOv3-Base and C-RADIOv4 variants share the same inference architecture and deployment cost.

Table 1. Visible-person detection results on CrowdHuman validation. Best result in bold, second-best underlined.

Model	mAP	AR
RT-DETRv4-S (C-RADIOv4-SO400M)	0.8410	0.9081
RT-DETRv4-S (DINOv3 Base)	0.8319	0.9048
YOLOv8-S	0.8263	0.8886
YOLO26-S	0.8163	0.8940
RF-DETR-S	0.8006	0.8682

Table 2. Head detection results on CrowdHuman validation (ignore-region mode 1). Best in bold, second-best underlined.

Model	mAP	AR
RT-DETRv4-S (C-RADIOv4-SO400M)	0.7881	0.8178
RT-DETRv4-S (DINOv3 Base)	0.7710	0.8025
YOLO26-S	0.7590	0.7955
YOLOv8-S	0.7336	0.7514
RF-DETR-S	0.6990	0.7350

Table 3. Visible-person mAP of default COCO-pretrained checkpoints evaluated on CrowdHuman validation. Best in bold, second-best underlined.

Model	mAP
RF-DETR-S	0.6956
YOLOv8-S	0.6935
RT-DETRv4-S	0.6846
YOLO26-S	0.6715

Table 4. TensorRT latency on NVIDIA T4 together with reported model complexity.

Model	Resolution	TensorRT Latency on T4 (ms)	Params (M)	GFLOPs
RT-DETRv4-S [12]	640	3.66	10	25
YOLO26-S [16]	640	2.5	9.5	20.7
YOLOv8-S [8, 15]	640	6.97	11	29
RF-DETR-S [13, 14]	512	3.50	32.1	59.8

Qualitative Analysis

Qualitative comparisons use 2 x 3 panels with the original image first and the five detectors ordered by descending mAP for the corresponding task. Visualizations use confidence threshold 0.40, and true-positive versus false-positive assignments are determined by greedy one-to-one matching at IoU 0.50. Dashed blue boxes denote ground truth, green boxes denote true positives, and red boxes denote false positives.

Model Inference

All model weights are available in this repository:

ckpts/rt-detrv4s-cradiov4-so400m/best_stg2.safetensors
ckpts/rt-detrv4s-dinov3b/best_stg2.safetensors
ckpts/rf-detrs/checkpoint_best_total.safetensors

Use the official repositories for inference:

RT-DETRv4 for the two RT-DETRv4-S checkpoints.
RF-DETR for the RF-DETR-S checkpoint.

Dependencies

The inference wrapper requires the following Python packages:

pip install torch safetensors pillow numpy supervision

For RT-DETRv4, also install the dependencies from the RT-DETRv4 repository. For RF-DETR, also install the dependencies from the RF-DETR repository (or pip install rfdetr).

Usage

Use scripts/inference.py to run inference. It handles checkpoint conversion automatically and runs inference through the official repository code. For RT-DETRv4 it also generates a minimal 2-class inference config automatically, so users do not need to hand-write one. The C-RADIOv4 teacher is not needed at inference time; only the trained student weights are used.

The RT-DETRv4 wrapper saves the converted checkpoint, the generated inference YAML, and the official torch_results.jpg / torch_results.mp4 output. The RF-DETR wrapper saves the converted checkpoint, an annotated image, and a JSON file with predictions.

Example RT-DETRv4 inference (C-RADIOv4 teacher):

python scripts/inference.py rtdetrv4 \
  --repo /path/to/RT-DETRv4 \
  --checkpoint ckpts/rt-detrv4s-cradiov4-so400m/best_stg2.safetensors \
  --input /path/to/image.jpg \
  --device cuda \
  --output-dir outputs/rtdetrv4_cradio

Example RT-DETRv4 inference (DINOv3-Base teacher):

python scripts/inference.py rtdetrv4 \
  --repo /path/to/RT-DETRv4 \
  --checkpoint ckpts/rt-detrv4s-dinov3b/best_stg2.safetensors \
  --input /path/to/image.jpg \
  --device cuda \
  --output-dir outputs/rtdetrv4_dinov3b

Example RF-DETR inference:

python scripts/inference.py rfdetr \
  --repo /path/to/rf-detr \
  --checkpoint ckpts/rf-detrs/checkpoint_best_total.safetensors \
  --input /path/to/image.jpg \
  --device cuda \
  --output-dir outputs/rfdetr

Standalone Checkpoint Conversion

This release also includes scripts/convert_release_checkpoint.py for users who only want checkpoint conversion without running inference.

# Convert RT-DETRv4 checkpoint
python scripts/convert_release_checkpoint.py \
  --framework rtdetrv4 \
  --input ckpts/rt-detrv4s-cradiov4-so400m/best_stg2.safetensors \
  --output converted/best_stg2.pth

# Convert RF-DETR checkpoint
python scripts/convert_release_checkpoint.py \
  --framework rfdetr \
  --input ckpts/rf-detrs/checkpoint_best_total.safetensors \
  --output converted/checkpoint_best_total.pth

Summary

This release packages the main checkpoints from the accompanying CrowdHuman study and highlights teacher selection as a key design choice for VFM-distilled detectors. With the RT-DETRv4-S student fixed, replacing DINOv3-Base with C-RADIOv4 improves visible-person mAP by +0.91 points and head mAP by +1.71 points, indicating that the gains come from a stronger transferred representation rather than from increased student capacity.

Among the compact models evaluated, the C-RADIOv4-distilled RT-DETRv4-S checkpoint is the strongest overall, reaching 0.8410 visible-person mAP and 0.7881 head mAP on CrowdHuman validation. Because the teacher is used only during training, these gains do not add inference-time cost, making the C-RADIOv4-based RT-DETRv4-S checkpoint the most practical release in this comparison.

Downloads last month: -; Downloads are not tracked for this model. How to track

Awiros
/

person_and_head_detection