A newer version of the Gradio SDK is available:
6.1.0
๐ Reference View Selection Strategy
๐ Overview
Reference view selection is a component in multi-view depth estimation. When processing multiple input views, the model needs to determine which view should serve as the primary reference frame for depth prediction, defining the world coordinate system.
Different reference view will leads to different reconstruction results. This is a known consideration in multi-view geometry and was analyzed in PI3. The choice of reference view can affect the quality and consistency of depth predictions across the scene.
๐ Our Simple Solution: Automatic Reference View Selection
DA3 provides a simple approach to address this through automatic reference view selection based on class tokens. Instead of relying on heuristics or manual selection, the model analyzes the class token features from all input views and intelligently selects the most suitable reference frame.
๐จ Available Strategies
1. โ๏ธ saddle_balanced (Recommended, Default)
Philosophy:
Select a view that achieves balance across multiple feature metrics. This strategy looks for a "middle ground" view that is neither too similar nor too different from other views, making it a stable reference point.
How it works:
- Extracts and normalizes class tokens from all views
- Computes three complementary metrics for each view:
- Similarity score: Average cosine similarity with other views
- Feature norm: L2 norm of the original features
- Feature variance: Variance across feature dimensions
- Normalizes each metric to [0, 1] range
- Selects the view closest to 0.5 (median) across all three metrics
2. ๐ข saddle_sim_range
Philosophy:
Select a view with the largest similarity range to other views. This identifies "saddle point" views that are highly similar to some views but dissimilar to others, making them information-rich anchor points.
How it works:
- Computes pairwise cosine similarity between all views
- For each view, calculates the range (max - min) of similarities to other views
- Selects the view with the maximum similarity range
3. 1๏ธโฃ first (Not Recommended)
Philosophy:
Always use the first view in the input sequence as the reference.
How it works: Simply returns index 0.
When to use:
- โ Not recommended in general
- ๐ง Only use when you have manually pre-sorted your views and know the first view is optimal
- ๐ Debugging or baseline comparisons
4. โธ๏ธ middle
Philosophy:
Select the view in the middle of the input sequence.
How it works:
Returns the view at index S // 2 where S is the number of views.
When to use:
- โฑ๏ธ Only recommended when input images are temporally ordered
- ๐ฌ Video sequences (e.g., DA3-LONG setting)
- ๐น Sequential captures where the middle frame likely has the most stable viewpoint
Specific use case: DA3-LONG ๐ฌ
In video-based depth estimation scenarios (like DA3-LONG), where inputs are consecutive frames, middle is often the optimal choice because that it has maximum overlap with all other frames.
๐ป Usage
๐ Python API
from depth_anything_3 import DepthAnything3
model = DepthAnything3.from_pretrained("depth-anything/DA3NESTED-GIANT-LARGE")
# Use default (saddle_balanced)
prediction = model.inference(
images,
ref_view_strategy="saddle_balanced"
)
# For video sequences, consider using middle
prediction = model.inference(
video_frames,
ref_view_strategy="middle" # Good for temporal sequences
)
# For complex scenes with wide baselines
prediction = model.inference(
images,
ref_view_strategy="saddle_sim_range"
)
๐ฅ๏ธ Command Line Interface
# Default (saddle_balanced)
da3 auto input/ --export-dir output/
# Explicitly specify strategy
da3 auto input/ --ref-view-strategy saddle_balanced
# For video processing
da3 video input.mp4 --ref-view-strategy middle
# For wide-baseline multi-view
da3 images captures/ --ref-view-strategy saddle_sim_range
๐ฏ When Selection Is Applied
Reference view selection is applied when:
- 3๏ธโฃ Number of views S โฅ 3
๐ก Recommendations
๐ Quick Guide
| Scenario | Recommended Strategy | Rationale |
|---|---|---|
| Default / Unknown | saddle_balanced |
Robust, balanced, works well across diverse scenarios |
| Video frames | middle |
Temporal coherence, stable middle frame |
| Wide-baseline multi-view | saddle_sim_range |
Maximizes information coverage |
| Pre-sorted inputs | first |
Use only if you've manually optimized ordering |
| Single image | first |
Automatically used (no reordering needed for S โค 2) |
โจ Best Practices
- ๐ฏ Start with defaults:
saddle_balancedworks well in most cases - ๐ฌ Consider your input type: Use
middlefor videos,saddle_balancedfor photos - ๐ฌ Experiment if needed: Try different strategies if results are suboptimal
- ๐ Monitor performance: Check
glbquality and consistency across views.
๐ง Technical Details
๐๏ธ Selection Threshold
The reference view selection is only triggered when:
num_views >= 3 # At least 3 views required
For 1-2 views, no reordering is performed (equivalent to using first).
โ๏ธ Implementation
The selection happens at layer alt_start - 1 in the vision transformer, before the first global attention layer. This ensures the selected reference view influences the entire depth prediction pipeline.
โ FAQ
Q: ๐ค Why is this feature provided?
A: The model can handle any view order, but this feature provides automatic optimization for reference view selection, which can help improve depth prediction quality in multi-view scenarios.
Q: โฑ๏ธ Does this add computational cost?
A: The overhead is totally negligible.
Q: ๐ฎ Can I manually specify which view to use as reference?
A: Not directly through this parameter. You can pre-sort your input images to place your preferred reference view first and use ref_view_strategy="first".
Q: โ๏ธ What happens if I don't specify this parameter?
A: The default saddle_balanced strategy is used automatically.
Q: ๐ Is this feature used in the DA3 paper benchmarks?
A: No, the paper used first as the default strategy for all multi-view experiments. The current default has been updated to saddle_balanced for better robustness.