RF-DETR-Temporal
RF-DETR-Temporal extends Roboflow's RF-DETR from a single-image detector into a multi-frame, motion-aware detector. It stacks three consecutive frames and fuses them with a small temporal pre-embedding module placed in front of the unmodified pretrained DINOv2 patch embed β so existing RF-DETR weights load verbatim and training starts at exact single-frame parity, then learns temporal cues as a residual.
Its whole purpose is moving objects β e.g. smoke or fire β that are typically small, distant, and semi-transparent in surveillance video: precisely the regime where a single frame is weakest and inter-frame motion is the discriminative cue a still-image detector throws away. The 9-channel design exists to exploit that motion.
Derived from
roboflow/rf-detr(Apache-2.0); significant changes were made β see Attribution & license.
Pipeline
The three input / fusion configurations. Only the front β the input and the orange TemporalPreEmbed block β differs; the patch embed β DINOv2 β LW-DETR decoder downstream is identical to upstream RF-DETR and loads its pretrained weights verbatim.
At initialisation R β‘ 0 and (for bgsub/bgsubcoh) the add-weight term vanishes on static
input, so the backbone receives exactly the current frame β identical to the single-frame
baseline. The temporal contribution is learned from there.
What's different from upstream
| Upstream RF-DETR | RF-DETR-Temporal | |
|---|---|---|
| Input | 1 frame, (B,3,H,W) |
3 stacked frames, (B,9,H,W), last = current |
| Multi-channel | widens the patch-embed conv (sums channels β averages frames) | TemporalPreEmbed reduces 9β3 before the unmodified 3ch patch embed |
| Pretrained patch embed | lossily widened | loaded unchanged; new module absorbed by strict=False |
| Init behaviour | β | exact single-frame parity; temporal learned as a residual |
| New config | β | in_channels, temporal_fusion β {none, preembed, bgsub, bgsubcoh} |
| Training | PyTorch-Lightning stack | standalone DDP script + manifest loader + temporal-aware augmentation |
| Tooling | β | diagnostics/ suite (parity, motion-/size-stratified eval, β¦) |
The detector, decoder, heads, loss, and matcher are unchanged; the only model change is the new module inside the backbone plus two config fields and a one-line gate in the weight loader.
How it works (short)
- The pitfall. Naively widening the patch-embed
Conv2dto 9 channels makes the embedding compute the temporal average of the frames β a low-pass / motion-blur op that destroys the change signal and feeds the backbone an out-of-distribution blurred image. - The fix β
TemporalPreEmbed. Reduce9β3channels with a small motion-aware module before the unmodified 3-channel patch embed, so DINOv2/RF-DETR weights load verbatim and the model starts at single-frame parity. Three fusion modes (preembed,bgsub,bgsubcoh) trade off how the temporal/motion signal is injected. - The data lever. A size census showed the validation set was dominated by large instances with the small-object regime nearly empty β so the detection bottleneck was data, not architecture. A small-object copy-paste augmentation manufactures the missing small (optionally moving) targets.
The only architectural change is the front module. In preembed it is a zero-init residual on
motion (frame) differences β it adds nothing at initialisation (so the model starts exactly at the
single-frame baseline) and learns the temporal cue as a residual from there:
Full details: docs/architecture.md Β· docs/temporal-fusion.md Β· docs/training.md Β· docs/findings.md
Quick start
pip install uv && uv sync --all-groups # PyTorch >=2.2,<3; transformers >=5,<6; Python >=3.10
# provide your data as two manifests β data_manifests/{train,valid}.txt β one clip per line:
# /abs/frame0|/abs/frame1|/abs/frame2|<labels>
# where <labels> are YOLO "cls cx cy w h" boxes for the LAST (current) frame. See docs/training.md.
# train on 4 GPUs (coherence-gated motion add-weight + moving small-object augmentation)
DATA_DIR=data_manifests NUM_GPUS=4 RESOLUTION=952 \
TEMPORAL_FUSION=bgsubcoh AUG_SMALLOBJ_P=0.5 AUG_SMALLOBJ_MOTION=14 \
uv run --no-sync python train_temporal_base_v4.py
Env-var reference, data format, ONNX export and inference: docs/training.md.
Results
Accuracy is class-averaged mAP@0.5. "Aggregate" is on the real validation set (dominated by large instances); "small-object" is on a deterministic synthetic small-object set, since the real data has almost no small instances (the data bottleneck β see below).
| Configuration | Aggregate mAP@0.5 | Synthetic small-object mAP@0.5 |
|---|---|---|
| single-frame baseline | 0.875 | β |
| naive 9-channel (temporal averaging) | ~0.892 | β |
preembed (temporal) |
0.907 | ~0.06 |
preembed + small-object augmentation |
~0.88 | ~0.80 |
bgsub (plain motion add-weight) |
0.891 | β |
bgsubcoh + moving-object augmentation |
in progress | in progress |
Reading: the temporal fix recovers and exceeds the single-frame baseline; the plain motion add-weight raises motion-region attention ~3Γ but did not improve aggregate detection (an honest negative result); and the real lever for the moving/small regime was data β synthetic small-object augmentation lifts small-object mAP from β0.06 to β0.80. Full record, caveats, and the motion-stratified breakdown: docs/findings.md.
Speed
The temporal extension is near-zero overhead: the DINOv2 backbone and LW-DETR decoder are unchanged, and the fusion module adds just 438 parameters (0.001% of the model) β a few convolutions at input resolution. Per-inference network latency therefore matches upstream RF-DETR Base (real-time class) on the same hardware; the only added runtime cost is decoding 3 frames instead of 1 per inference (I/O, not compute). Absolute FPS is not separately benchmarked here.
Repository layout
src/rfdetr/ # upstream RF-DETR (Apache-2.0), minimally modified
βββ models/backbone/temporal_fusion.py # β
TemporalPreEmbed (9chβ3ch motion fusion)
train_temporal_base_v4.py # β
DDP training entrypoint + dataset + augmentations
diagnostics/ # β
probes + stratified evaluators
export_onnx.py # ONNX export (9-channel temporal model)
docs/ # detailed design / training / findings
Generated locally and git-ignored: runs/, onnx_exports/, data_manifests/, and all
*.pth/*.onnx/*.mp4 artifacts (see .gitignore).
Attribution & license
Derivative work of RF-DETR by Roboflow, licensed under
the Apache License 2.0, and itself released under Apache-2.0. The upstream LICENSE is retained
and all upstream source files keep their original license headers. Per Apache Β§4, this notice
states that significant changes were made: a new temporal pre-embedding module and fusion
modes, 9-channel input wiring, a small-object augmentation, and standalone training/diagnostics
tooling. The DINOv2-with-Registers backbone code is itself derived from HuggingFace Transformers
(see that file's header).

