A newer version of the Gradio SDK is available:
6.1.0
Benchmark Results
Performance benchmarks comparing awesome-depth-anything-3 (optimized fork) against the vanilla upstream implementation.
Test Environment: Apple Silicon (M-series), PyTorch 2.9.0 Models: da3-small, da3-base, da3-large, da3-giant
Quick Summary
| Feature | Improvement |
|---|---|
| Model Loading (cached) | 200x faster (0.8s β 0.005s) |
| Inference (MPS, batch 4) | 1.14x faster |
| Cold Load Time | 1.7x faster |
| Memory Efficiency | Adaptive batching prevents OOM |
1. Awesome vs Upstream Comparison
Direct comparison between this optimized fork and the original upstream repository.
MPS (Apple Silicon GPU)
| Batch Size | Upstream | Awesome | Speedup | Notes |
|---|---|---|---|---|
| 1 | 3.47 img/s | 3.50 img/s | 1.01x | Minimal overhead |
| 2 | 3.64 img/s | 3.83 img/s | 1.05x | Batching benefits |
| 4 | 3.32 img/s | 3.78 img/s | 1.14x | Best improvement |
Model Loading Performance
| Metric | Upstream | Awesome | Speedup |
|---|---|---|---|
| Cold Load | 1.28s | 0.77s | 1.7x |
| Cached Load | N/A | 0.005s | ~200x |
The model caching system is the standout feature - after the first load, subsequent loads are essentially instant.
CPU
| Batch Size | Upstream | Awesome | Speedup |
|---|---|---|---|
| 1 | 0.27 img/s | 0.31 img/s | 1.13x |
| 2 | 0.24 img/s | 0.24 img/s | 1.00x |
| 4 | 0.17 img/s | 0.16 img/s | 0.95x |
Note: CPU performance is similar between versions since GPU-specific optimizations don't apply. The slight regression at batch 4 is within measurement noise.
2. Model Performance by Size
Throughput benchmarks on MPS (Apple Silicon) with 1280x720 input images.
| Model | Parameters | Batch 1 | Batch 4 | Best Config |
|---|---|---|---|---|
| da3-small | ~25M | 22.2 img/s | 27.2 img/s | B=4 SDPA |
| da3-base | ~100M | 10.7 img/s | 11.6 img/s | B=4 SDPA |
| da3-large | ~335M | 3.8 img/s | 3.8 img/s | B=1-2 |
| da3-giant | ~1.1B | 1.6 img/s | 1.2 img/s | B=1 |
Latency (single image)
| Model | MPS | CPU | MPS Speedup |
|---|---|---|---|
| da3-small | 45 ms | ~3,500 ms | ~78x |
| da3-base | 94 ms | ~7,000 ms | ~74x |
| da3-large | 265 ms | ~3,900 ms | ~15x |
| da3-giant | 618 ms | N/A | - |
3. Preprocessing Pipeline
Strategy: Hybrid CPU/GPU
On Apple Silicon, CPU preprocessing is faster than GPU (Kornia) due to optimized OpenCV/Accelerate routines. The overhead of MPS kernel launches exceeds the benefit for image transforms.
| Resolution | CPU Time | GPU Time | Winner |
|---|---|---|---|
| 640x480 | 6.0 ms | N/A | CPU |
| 1920x1080 | 18.7 ms | N/A | CPU |
| 3840x2160 | 57.0 ms | N/A | CPU |
Design Decision: GPU preprocessing is automatically disabled on MPS. The GPU is reserved for model inference where it provides 15-78x speedup.
CUDA (NVIDIA)
On CUDA, GPU preprocessing with NVJPEG provides significant benefits for JPEG decoding directly to GPU memory, eliminating CPUβGPU transfer overhead.
4. Attention Mechanisms
Comparison between SDPA (Scaled Dot-Product Attention / Flash Attention) and manual attention implementation.
Per-Layer Performance
| Config | SDPA | Manual | Speedup |
|---|---|---|---|
| ViT-L 518px (MPS) | 2.21 ms | 1.86 ms | 0.8x |
| ViT-L 1024px (MPS) | 9.91 ms | 5.87 ms | 0.6x |
| ViT-L 518px (CPU) | 3.75 ms | 4.96 ms | 1.3x |
| ViT-L 1024px (CPU) | 11.73 ms | 16.85 ms | 1.4x |
Insight: On MPS, manual attention is faster for ViT due to MPS's SDPA implementation overhead. On CPU, SDPA benefits from optimized BLAS operations.
End-to-End Impact
| Model | SDPA | Manual | Best |
|---|---|---|---|
| da3-small | 21.8 img/s | 22.2 img/s | Manual |
| da3-base | 9.8 img/s | 10.7 img/s | Manual |
| da3-large | 3.8 img/s | 3.7 img/s | SDPA |
| da3-giant | 1.6 img/s | 1.6 img/s | Tie |
5. Adaptive Batching
The adaptive batching system dynamically adjusts batch size based on available GPU memory.
Test: 20 images with da3-large on MPS
| Strategy | Total Time | Throughput | Batches Used |
|---|---|---|---|
| Fixed B=1 | 5,612 ms | 3.6 img/s | [1,1,1...] |
| Fixed B=2 | 5,514 ms | 3.6 img/s | [2,2,2...] |
| Fixed B=4 | 8,305 ms | 2.4 img/s | [4,4,4,4,4] |
| Adaptive 85% | 5,637 ms | 3.5 img/s | [4,4,4...] |
Recommendation: For MPS with da3-large, fixed batch size of 2 provides optimal throughput. Adaptive batching is more valuable for:
- Variable input sizes
- Unknown GPU memory constraints
- Preventing OOM errors on smaller GPUs
6. Cross-Device Comparison
Inference Throughput (da3-large, batch=1)
MPS (Apple Silicon) ββββββββββββββββββββββββββββββββββββββββ 3.7 img/s
CPU βββ 0.3 img/s
MPS provides ~12x speedup over CPU for da3-large inference.
Attention Layer (ViT-L 518px, SDPA)
MPS ββββββββββββββββββββββββ 2.40 ms
CPU βββββββββββββββββββββββββββββββββββββββ 3.75 ms
7. Optimization Recommendations
For Apple Silicon (MPS)
- Use model caching - 200x faster subsequent loads
- Batch size 2-4 for da3-small/base, batch 1-2 for da3-large/giant
- Let CPU handle preprocessing - it's faster than MPS for image transforms
- SDPA vs Manual: Both are similar; SDPA slightly better for larger models
For NVIDIA CUDA
- Enable GPU preprocessing with NVJPEG for JPEG inputs
- Use SDPA (Flash Attention) - significant speedup
- Larger batch sizes benefit more from GPU parallelism
- Adaptive batching to maximize VRAM utilization
For CPU-only
- Use smallest viable model (da3-small: 22x faster than da3-giant)
- Batch size 1 is optimal (memory bandwidth limited)
- SDPA provides 1.3-1.4x speedup on CPU
Running Benchmarks
# Quick benchmark (fewer iterations)
uv run python benchmarks/full_benchmark.py --quick
# Full benchmark on specific device
uv run python benchmarks/full_benchmark.py --device mps
uv run python benchmarks/full_benchmark.py --device cuda
uv run python benchmarks/full_benchmark.py --device cpu
# Compare against upstream (requires upstream repo)
uv run python benchmarks/comparative_benchmark.py --device all
# Skip specific tests
uv run python benchmarks/full_benchmark.py --skip-batching
Methodology
- Warmup: 2 inference passes before timing
- Runs: 3-5 iterations per configuration
- Synchronization:
torch.mps.synchronize()/torch.cuda.synchronize()for accurate GPU timing - Memory cleanup:
gc.collect()+ cache clearing between tests - Input: Synthetic 1280x720 RGB images (consistent across tests)
Benchmarks last updated: December 2024 Hardware: Apple Silicon (M-series) | Software: PyTorch 2.9.0