AnchorDepth

Consistency-Anchored Self-Supervised Adaptation of Depth Pro on Consumer GPUs

AnchorDepth is a parameter-efficient self-supervised adaptation of the Depth Pro foundation model (Bochkovskii et al., Apple, 2024) for outdoor monocular depth estimation. It is trained on KITTI using a Monodepth2-style photometric loss combined with a consistency anchor that prevents the fine-tuned model from drifting away from the strong zero-shot baseline. The entire training pipeline fits in 12 GB of VRAM and trains in ~12 hours per configuration on a single RTX 4070 Ti.

Highlights

  • Improves over zero-shot Depth Pro on KITTI Eigen on 4 of 7 metrics β€” AbsRel (βˆ’1.6%), RMSElog (βˆ’3.3%), Ξ΄<1.25 (+1.3 pp), Ξ΄<1.25Β³ β€” while staying within 1–2% on the remaining three.
  • Wins on Cityscapes β€” improves over zero-shot on all 7 standard metrics (AbsRel βˆ’3.0%, RMSE βˆ’4.6%, Ξ΄<1.25 +1.76 pp).
  • Wins on Make3D β€” improves over zero-shot on all 5 standard metrics with double-digit gains (AbsRel βˆ’24.7%, SqRel βˆ’55.1%).
  • Consumer-GPU only β€” 34 M trainable parameters out of 966 M total (3.6%), trained on a single 12 GB GPU.

Quick Start

from huggingface_hub import hf_hub_download
import torch, depth_pro
from PIL import Image
from torchvision.transforms import Normalize, ToTensor

# Download model weights (~3.8 GB, cached after first call)
ckpt_path = hf_hub_download(repo_id="dariusan3/AnchorDepth",
                            filename="anchordepth.pt")

device = torch.device("cuda")
model, _ = depth_pro.create_model_and_transforms(device=device)
model.load_state_dict(torch.load(ckpt_path, map_location=device), strict=True)
model.eval()

# Predict depth for an image
img = Image.open("image.jpg").convert("RGB").resize((1536, 1536), Image.LANCZOS)
inp = Normalize([0.5]*3, [0.5]*3)(ToTensor()(img)).unsqueeze(0).to(device)

with torch.no_grad(), torch.amp.autocast("cuda"):
    canonical_inv_depth, fov_deg = model(inp)
    f_px = 0.5 * 1536 / torch.tan(0.5 * torch.deg2rad(fov_deg.float()))
    depth = 1.0 / torch.clamp(canonical_inv_depth * (1536 / f_px), 1e-4, 1e4)

depth_map_metres = depth.squeeze().cpu().float().numpy()

Dependencies: torch, depth_pro (Apple's reference implementation), PIL, torchvision, huggingface_hub. No LoRA library required at inference β€” the LoRA adapters have been merged into the base weights.

Performance

KITTI Eigen (697 test images, median scaling)

Method AbsRel ↓ SqRel ↓ RMSE ↓ RMSElog ↓ Ξ΄<1.25 ↑ Ξ΄<1.25Β² ↑ Ξ΄<1.25Β³ ↑
Monodepth2 (ICCV'19) 0.115 0.903 4.863 0.193 0.877 0.959 0.981
MonoViT (3DV'22) 0.099 0.708 4.372 0.175 0.900 0.967 0.984
Depth Pro zero-shot 0.0866 0.543 3.893 0.166 0.9253 0.9725 0.98494
AnchorDepth (ours) 0.0852 0.545 3.957 0.160 0.9265 0.9724 0.98499

Cityscapes (500 val images, zero-shot cross-domain)

Method AbsRel ↓ RMSE ↓ RMSElog ↓ Ξ΄<1.25 ↑
Monodepth2 0.129 6.876 0.187 0.849
ManyDepth 0.114 6.223 0.170 0.875
Depth Pro zero-shot 0.1119 6.636 0.196 0.8773
AnchorDepth (ours) 0.1085 6.331 0.1918 0.8927

Make3D (134 test images, zero-shot cross-domain)

Method AbsRel ↓ SqRel ↓ RMSE ↓ RMSElog ↓
Monodepth2 0.322 3.589 7.417 0.163
CADepth-Net 0.312 3.086 7.066 0.159
Depth Pro zero-shot 0.2575 4.846 6.677 0.301
AnchorDepth (ours) 0.1940 2.175 5.293 0.2555

Method

The training objective combines a Monodepth2-style photometric reconstruction loss with a consistency anchor:

L=Lphotometric+Ξ»β‹…βˆ₯dpredβˆ’dzero-shotβˆ₯1L = L_{\text{photometric}} + \lambda \cdot \| d_{\text{pred}} - d_{\text{zero-shot}} \|_1

where $d_{\text{zero-shot}}$ is the pretrained Depth Pro prediction on the same image, precomputed offline and cached on disk. The anchor prevents the photometric gradient from corrupting the metric-depth structure that the foundation model already encodes.

LoRA adapters (rank 8, Ξ± = 8) are inserted into all 96 attention Q/K/V/output projections of the two ViT-Large encoders in Depth Pro (2.36 M trainable parameters). The decoder, depth head and PoseNet (ResNet-18) are trained from scratch in parallel. Training uses bfloat16 mixed precision, gradient checkpointing on both encoders, and gradient accumulation for an effective batch size of 4.

Limitations

  • Cross-domain transfer is benchmark-dependent. AnchorDepth was trained on KITTI. Performance on indoor scenes (NYU) was not evaluated.
  • PoseNet is randomly initialised. Replacing it with a precomputed cache from a multi-view foundation model (e.g. VGGT) is left as future work.
  • The depth head is taken from Depth Pro unchanged. No retraining of the FOV head was performed; evaluation uses ground-truth camera intrinsics where available.

Citation

If you use AnchorDepth in your work, please cite:

@thesis{osadici2026anchordepth,
  title  = {AnchorDepth: Consistency-Anchored Self-Supervised Adaptation
            of Depth Pro on Consumer GPUs},
  author = {Osadici, Darius},
  year   = {2026},
  school = {Politehnica University of TimiΘ™oara},
  type   = {Bachelor's thesis}
}

Acknowledgements

  • Depth Pro (Apple, 2024) β€” backbone foundation model. Bochkovskii et al., Depth Pro: Sharp Monocular Metric Depth in Less than a Second. https://github.com/apple/ml-depth-pro
  • Monodepth2 (Godard et al., ICCV 2019) β€” photometric loss formulation.
  • LoRA (Hu et al., ICLR 2022) β€” parameter-efficient fine-tuning.

License

This model inherits the Apple AMLR License from the Depth Pro backbone. Please refer to the Depth Pro repository for the full license terms.

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support