awesome-depth-anything-3 / BENCHMARKS.md
Delanoe Pirard
Deploy to HuggingFace Spaces
18b382b

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Benchmark Results

Performance benchmarks comparing awesome-depth-anything-3 (optimized fork) against the vanilla upstream implementation.

Test Environment: Apple Silicon (M-series), PyTorch 2.9.0 Models: da3-small, da3-base, da3-large, da3-giant


Quick Summary

Feature Improvement
Model Loading (cached) 200x faster (0.8s β†’ 0.005s)
Inference (MPS, batch 4) 1.14x faster
Cold Load Time 1.7x faster
Memory Efficiency Adaptive batching prevents OOM

1. Awesome vs Upstream Comparison

Direct comparison between this optimized fork and the original upstream repository.

MPS (Apple Silicon GPU)

Batch Size Upstream Awesome Speedup Notes
1 3.47 img/s 3.50 img/s 1.01x Minimal overhead
2 3.64 img/s 3.83 img/s 1.05x Batching benefits
4 3.32 img/s 3.78 img/s 1.14x Best improvement

Model Loading Performance

Metric Upstream Awesome Speedup
Cold Load 1.28s 0.77s 1.7x
Cached Load N/A 0.005s ~200x

The model caching system is the standout feature - after the first load, subsequent loads are essentially instant.

CPU

Batch Size Upstream Awesome Speedup
1 0.27 img/s 0.31 img/s 1.13x
2 0.24 img/s 0.24 img/s 1.00x
4 0.17 img/s 0.16 img/s 0.95x

Note: CPU performance is similar between versions since GPU-specific optimizations don't apply. The slight regression at batch 4 is within measurement noise.


2. Model Performance by Size

Throughput benchmarks on MPS (Apple Silicon) with 1280x720 input images.

Model Parameters Batch 1 Batch 4 Best Config
da3-small ~25M 22.2 img/s 27.2 img/s B=4 SDPA
da3-base ~100M 10.7 img/s 11.6 img/s B=4 SDPA
da3-large ~335M 3.8 img/s 3.8 img/s B=1-2
da3-giant ~1.1B 1.6 img/s 1.2 img/s B=1

Latency (single image)

Model MPS CPU MPS Speedup
da3-small 45 ms ~3,500 ms ~78x
da3-base 94 ms ~7,000 ms ~74x
da3-large 265 ms ~3,900 ms ~15x
da3-giant 618 ms N/A -

3. Preprocessing Pipeline

Strategy: Hybrid CPU/GPU

On Apple Silicon, CPU preprocessing is faster than GPU (Kornia) due to optimized OpenCV/Accelerate routines. The overhead of MPS kernel launches exceeds the benefit for image transforms.

Resolution CPU Time GPU Time Winner
640x480 6.0 ms N/A CPU
1920x1080 18.7 ms N/A CPU
3840x2160 57.0 ms N/A CPU

Design Decision: GPU preprocessing is automatically disabled on MPS. The GPU is reserved for model inference where it provides 15-78x speedup.

CUDA (NVIDIA)

On CUDA, GPU preprocessing with NVJPEG provides significant benefits for JPEG decoding directly to GPU memory, eliminating CPU→GPU transfer overhead.


4. Attention Mechanisms

Comparison between SDPA (Scaled Dot-Product Attention / Flash Attention) and manual attention implementation.

Per-Layer Performance

Config SDPA Manual Speedup
ViT-L 518px (MPS) 2.21 ms 1.86 ms 0.8x
ViT-L 1024px (MPS) 9.91 ms 5.87 ms 0.6x
ViT-L 518px (CPU) 3.75 ms 4.96 ms 1.3x
ViT-L 1024px (CPU) 11.73 ms 16.85 ms 1.4x

Insight: On MPS, manual attention is faster for ViT due to MPS's SDPA implementation overhead. On CPU, SDPA benefits from optimized BLAS operations.

End-to-End Impact

Model SDPA Manual Best
da3-small 21.8 img/s 22.2 img/s Manual
da3-base 9.8 img/s 10.7 img/s Manual
da3-large 3.8 img/s 3.7 img/s SDPA
da3-giant 1.6 img/s 1.6 img/s Tie

5. Adaptive Batching

The adaptive batching system dynamically adjusts batch size based on available GPU memory.

Test: 20 images with da3-large on MPS

Strategy Total Time Throughput Batches Used
Fixed B=1 5,612 ms 3.6 img/s [1,1,1...]
Fixed B=2 5,514 ms 3.6 img/s [2,2,2...]
Fixed B=4 8,305 ms 2.4 img/s [4,4,4,4,4]
Adaptive 85% 5,637 ms 3.5 img/s [4,4,4...]

Recommendation: For MPS with da3-large, fixed batch size of 2 provides optimal throughput. Adaptive batching is more valuable for:

  • Variable input sizes
  • Unknown GPU memory constraints
  • Preventing OOM errors on smaller GPUs

6. Cross-Device Comparison

Inference Throughput (da3-large, batch=1)

MPS (Apple Silicon)  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  3.7 img/s
CPU                  β–ˆβ–ˆβ–ˆ                                       0.3 img/s

MPS provides ~12x speedup over CPU for da3-large inference.

Attention Layer (ViT-L 518px, SDPA)

MPS   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  2.40 ms
CPU   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  3.75 ms

7. Optimization Recommendations

For Apple Silicon (MPS)

  1. Use model caching - 200x faster subsequent loads
  2. Batch size 2-4 for da3-small/base, batch 1-2 for da3-large/giant
  3. Let CPU handle preprocessing - it's faster than MPS for image transforms
  4. SDPA vs Manual: Both are similar; SDPA slightly better for larger models

For NVIDIA CUDA

  1. Enable GPU preprocessing with NVJPEG for JPEG inputs
  2. Use SDPA (Flash Attention) - significant speedup
  3. Larger batch sizes benefit more from GPU parallelism
  4. Adaptive batching to maximize VRAM utilization

For CPU-only

  1. Use smallest viable model (da3-small: 22x faster than da3-giant)
  2. Batch size 1 is optimal (memory bandwidth limited)
  3. SDPA provides 1.3-1.4x speedup on CPU

Running Benchmarks

# Quick benchmark (fewer iterations)
uv run python benchmarks/full_benchmark.py --quick

# Full benchmark on specific device
uv run python benchmarks/full_benchmark.py --device mps
uv run python benchmarks/full_benchmark.py --device cuda
uv run python benchmarks/full_benchmark.py --device cpu

# Compare against upstream (requires upstream repo)
uv run python benchmarks/comparative_benchmark.py --device all

# Skip specific tests
uv run python benchmarks/full_benchmark.py --skip-batching

Methodology

  • Warmup: 2 inference passes before timing
  • Runs: 3-5 iterations per configuration
  • Synchronization: torch.mps.synchronize() / torch.cuda.synchronize() for accurate GPU timing
  • Memory cleanup: gc.collect() + cache clearing between tests
  • Input: Synthetic 1280x720 RGB images (consistent across tests)

Benchmarks last updated: December 2024 Hardware: Apple Silicon (M-series) | Software: PyTorch 2.9.0