Spaces:

Aedelon
/

awesome-depth-anything-3

Running

App Files Files Community

awesome-depth-anything-3 / BENCHMARKS.md

Delanoe Pirard

Deploy to HuggingFace Spaces

18b382b 8 days ago

preview code

raw

history blame contribute delete

7.26 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Benchmark Results

Performance benchmarks comparing awesome-depth-anything-3 (optimized fork) against the vanilla upstream implementation.

Test Environment: Apple Silicon (M-series), PyTorch 2.9.0 Models: da3-small, da3-base, da3-large, da3-giant

Quick Summary

Feature	Improvement
Model Loading (cached)	200x faster (0.8s → 0.005s)
Inference (MPS, batch 4)	1.14x faster
Cold Load Time	1.7x faster
Memory Efficiency	Adaptive batching prevents OOM

1. Awesome vs Upstream Comparison

Direct comparison between this optimized fork and the original upstream repository.

MPS (Apple Silicon GPU)

Batch Size	Upstream	Awesome	Speedup	Notes
1	3.47 img/s	3.50 img/s	1.01x	Minimal overhead
2	3.64 img/s	3.83 img/s	1.05x	Batching benefits
4	3.32 img/s	3.78 img/s	1.14x	Best improvement

Model Loading Performance

Metric	Upstream	Awesome	Speedup
Cold Load	1.28s	0.77s	1.7x
Cached Load	N/A	0.005s	~200x

The model caching system is the standout feature - after the first load, subsequent loads are essentially instant.

CPU

Batch Size	Upstream	Awesome	Speedup
1	0.27 img/s	0.31 img/s	1.13x
2	0.24 img/s	0.24 img/s	1.00x
4	0.17 img/s	0.16 img/s	0.95x

Note: CPU performance is similar between versions since GPU-specific optimizations don't apply. The slight regression at batch 4 is within measurement noise.

2. Model Performance by Size

Throughput benchmarks on MPS (Apple Silicon) with 1280x720 input images.

Model	Parameters	Batch 1	Batch 4	Best Config
da3-small	~25M	22.2 img/s	27.2 img/s	B=4 SDPA
da3-base	~100M	10.7 img/s	11.6 img/s	B=4 SDPA
da3-large	~335M	3.8 img/s	3.8 img/s	B=1-2
da3-giant	~1.1B	1.6 img/s	1.2 img/s	B=1

Latency (single image)

Model	MPS	CPU	MPS Speedup
da3-small	45 ms	~3,500 ms	~78x
da3-base	94 ms	~7,000 ms	~74x
da3-large	265 ms	~3,900 ms	~15x
da3-giant	618 ms	N/A	-

3. Preprocessing Pipeline

Strategy: Hybrid CPU/GPU

On Apple Silicon, CPU preprocessing is faster than GPU (Kornia) due to optimized OpenCV/Accelerate routines. The overhead of MPS kernel launches exceeds the benefit for image transforms.

Resolution	CPU Time	GPU Time	Winner
640x480	6.0 ms	N/A	CPU
1920x1080	18.7 ms	N/A	CPU
3840x2160	57.0 ms	N/A	CPU

Design Decision: GPU preprocessing is automatically disabled on MPS. The GPU is reserved for model inference where it provides 15-78x speedup.

CUDA (NVIDIA)

On CUDA, GPU preprocessing with NVJPEG provides significant benefits for JPEG decoding directly to GPU memory, eliminating CPU→GPU transfer overhead.

4. Attention Mechanisms

Comparison between SDPA (Scaled Dot-Product Attention / Flash Attention) and manual attention implementation.

Per-Layer Performance

Config	SDPA	Manual	Speedup
ViT-L 518px (MPS)	2.21 ms	1.86 ms	0.8x
ViT-L 1024px (MPS)	9.91 ms	5.87 ms	0.6x
ViT-L 518px (CPU)	3.75 ms	4.96 ms	1.3x
ViT-L 1024px (CPU)	11.73 ms	16.85 ms	1.4x

Insight: On MPS, manual attention is faster for ViT due to MPS's SDPA implementation overhead. On CPU, SDPA benefits from optimized BLAS operations.

End-to-End Impact

Model	SDPA	Manual	Best
da3-small	21.8 img/s	22.2 img/s	Manual
da3-base	9.8 img/s	10.7 img/s	Manual
da3-large	3.8 img/s	3.7 img/s	SDPA
da3-giant	1.6 img/s	1.6 img/s	Tie

5. Adaptive Batching

The adaptive batching system dynamically adjusts batch size based on available GPU memory.

Test: 20 images with da3-large on MPS

Strategy	Total Time	Throughput	Batches Used
Fixed B=1	5,612 ms	3.6 img/s	[1,1,1...]
Fixed B=2	5,514 ms	3.6 img/s	[2,2,2...]
Fixed B=4	8,305 ms	2.4 img/s	[4,4,4,4,4]
Adaptive 85%	5,637 ms	3.5 img/s	[4,4,4...]

Recommendation: For MPS with da3-large, fixed batch size of 2 provides optimal throughput. Adaptive batching is more valuable for:

Variable input sizes

Unknown GPU memory constraints

Preventing OOM errors on smaller GPUs

6. Cross-Device Comparison

Inference Throughput (da3-large, batch=1)

MPS (Apple Silicon)  ████████████████████████████████████████  3.7 img/s
CPU                  ███                                       0.3 img/s

MPS provides ~12x speedup over CPU for da3-large inference.

Attention Layer (ViT-L 518px, SDPA)

MPS   ████████████████████████  2.40 ms
CPU   ███████████████████████████████████████  3.75 ms

7. Optimization Recommendations

For Apple Silicon (MPS)

Use model caching - 200x faster subsequent loads
Batch size 2-4 for da3-small/base, batch 1-2 for da3-large/giant
Let CPU handle preprocessing - it's faster than MPS for image transforms
SDPA vs Manual: Both are similar; SDPA slightly better for larger models

For NVIDIA CUDA

Enable GPU preprocessing with NVJPEG for JPEG inputs
Use SDPA (Flash Attention) - significant speedup
Larger batch sizes benefit more from GPU parallelism
Adaptive batching to maximize VRAM utilization

For CPU-only

Use smallest viable model (da3-small: 22x faster than da3-giant)
Batch size 1 is optimal (memory bandwidth limited)
SDPA provides 1.3-1.4x speedup on CPU

Running Benchmarks

# Quick benchmark (fewer iterations)
uv run python benchmarks/full_benchmark.py --quick

# Full benchmark on specific device
uv run python benchmarks/full_benchmark.py --device mps
uv run python benchmarks/full_benchmark.py --device cuda
uv run python benchmarks/full_benchmark.py --device cpu

# Compare against upstream (requires upstream repo)
uv run python benchmarks/comparative_benchmark.py --device all

# Skip specific tests
uv run python benchmarks/full_benchmark.py --skip-batching

Methodology

Warmup: 2 inference passes before timing
Runs: 3-5 iterations per configuration
Synchronization: torch.mps.synchronize() / torch.cuda.synchronize() for accurate GPU timing
Memory cleanup: gc.collect() + cache clearing between tests
Input: Synthetic 1280x720 RGB images (consistent across tests)

Benchmarks last updated: December 2024 Hardware: Apple Silicon (M-series) | Software: PyTorch 2.9.0