DinoFlow β DINOv3 ViT-S/16 + correlation-augmented SDT optical-flow head
A compact optical-flow decoder on a frozen DINOv3 ViT-S/16 backbone, generalizing the
AnyDepth SDT recipe (arXiv:2601.02760) to two-frame flow. Only the small decoder is trained; the
DINOv3 encoder is frozen and run on the fly on both frames. Trained on the standard FlowNet/RAFT
C+T corpus (FlyingChairs β FlyingThings3D) from
blanchon/dinoflow-dataset.
Code: https://github.com/julien-blanchon/dinodepth (src/dinov3_dense).
Architecture
The depth SDT trunk, reused verbatim, with a flow front-end:
- The frozen DINOv3 backbone runs on both frames (siamese); a shared softmax
WeightedFusion
collapses each frame's 4 tapped layers into a feature grid at stride H/16.
- A local correlation cost volume (radius 4 β Β±64 px, 81 neighbors) plus the feature difference
between the two grids form the motion signal.
- The AnyDepth trunk β
SpatialDetailEnhancer β two learned DySample Γ4 stages β upsamples back to
full resolution, and a final conv emits 2 channels (u, v) instead of single-channel disparity.
Single forward pass (no RAFT-style iterative refinement). Decoder: 6.88 M parameters.
Usage
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from dinov3_dense.head import FlowModel, FlowModelConfig
model = FlowModel.from_pretrained(FlowModelConfig(backbone="vits16"))
model.head.load_state_dict(load_file(hf_hub_download("blanchon/dinoflow-model", "flow-vits16.safetensors")))
model.eval()
flow = model(image1, image2)
Zero-shot benchmark
Full-split evaluation on Sintel (train, 1041 pairs/pass) and KITTI-2015 (train, 200 pairs) with the
standard EPE / Fl-all protocol (no alignment), via anyflow-benchmark. EPE in px, lower is better.
| Method (C+T) |
Sintel-clean EPE |
Sintel-final EPE |
KITTI-15 EPE |
KITTI-15 Fl-all |
| RAFT |
1.43 |
2.71 |
5.04 |
17.4% |
| FlowFormer |
1.01 |
2.40 |
4.09 |
14.7% |
| SEA-RAFT |
1.19 |
4.11 |
3.62 |
12.9% |
| DinoFlow ViT-S (ours) |
3.97 |
5.06 |
19.79 |
61.6% |
Pixel accuracy (fraction within threshold): Sintel-clean px3 0.81 / px5 0.87; Sintel-final px3 0.77.
Honest positioning. This is a deliberately lightweight, single-pass probe β a frozen backbone
with a tiny decoder and no iterative refinement β so it lands roughly at FlowNet level, well
behind the recurrent-refinement SOTA above. The weak spot is KITTI: its large automotive displacements
exceed the Β±64 px local-correlation range and the GT is sparse LiDAR, the known failure mode of a lite
correlation head trained on synthetic C+T only. Sintel (moderate motion, dense GT) is far stronger.
Training
- Frozen DINOv3 ViT-S/16, 4 tapped layers
[2, 5, 8, 11], ImageNet-normalized input.
- 24 epochs on combined C+T at 512Β², global batch 48, AdamW lr 4e-4 (poly decay, 2-epoch warmup),
masked-L1 end-point loss with a 400 px flow cap, RAFT-style augmentation, bf16 autocast.
- 4ΓGH200, ~3 h. See the GitHub repo for the exact config and
anyflow-train command.