File size: 6,638 Bytes
7a87926
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
# ๐Ÿ“ Reference View Selection Strategy

## ๐Ÿ“– Overview

Reference view selection is a component in multi-view depth estimation. When processing multiple input views, the model needs to determine which view should serve as the primary reference frame for depth prediction, defining the world coordinate system.

Different reference view will leads to different reconstruction results. This is a known consideration in multi-view geometry and was analyzed in [PI3](https://arxiv.org/abs/2507.13347). The choice of reference view can affect the quality and consistency of depth predictions across the scene.


## ๐Ÿš€ Our Simple Solution: Automatic Reference View Selection

DA3 provides a simple approach to address this through **automatic reference view selection** based on **class tokens**. Instead of relying on heuristics or manual selection, the model analyzes the class token features from all input views and intelligently selects the most suitable reference frame.

---

## ๐ŸŽจ Available Strategies

### 1. โš–๏ธ `saddle_balanced` (Recommended, Default)

**Philosophy:**  
Select a view that achieves balance across multiple feature metrics. This strategy looks for a "middle ground" view that is neither too similar nor too different from other views, making it a stable reference point.

**How it works:**
1. Extracts and normalizes class tokens from all views
2. Computes three complementary metrics for each view:
   - **Similarity score**: Average cosine similarity with other views
   - **Feature norm**: L2 norm of the original features  
   - **Feature variance**: Variance across feature dimensions
3. Normalizes each metric to [0, 1] range
4. Selects the view closest to 0.5 (median) across all three metrics

### 2. ๐ŸŽข `saddle_sim_range`

**Philosophy:**  
Select a view with the largest similarity range to other views. This identifies "saddle point" views that are highly similar to some views but dissimilar to others, making them information-rich anchor points.

**How it works:**
1. Computes pairwise cosine similarity between all views
2. For each view, calculates the range (max - min) of similarities to other views
3. Selects the view with the maximum similarity range

---

### 3. 1๏ธโƒฃ `first` (Not Recommended)

**Philosophy:**  
Always use the first view in the input sequence as the reference.

**How it works:**
Simply returns index 0.

**When to use:**
- โ›” **Not recommended** in general
- ๐Ÿ”ง Only use when you have manually pre-sorted your views and know the first view is optimal
- ๐Ÿ› Debugging or baseline comparisons

---

### 4. โธ๏ธ `middle`

**Philosophy:**  
Select the view in the middle of the input sequence.

**How it works:**
Returns the view at index `S // 2` where S is the number of views.

**When to use:**
- โฑ๏ธ **Only recommended when input images are temporally ordered**
- ๐ŸŽฌ Video sequences (e.g., **DA3-LONG** setting)
- ๐Ÿ“น Sequential captures where the middle frame likely has the most stable viewpoint

**Specific use case: DA3-LONG** ๐ŸŽฌ  
In video-based depth estimation scenarios (like DA3-LONG), where inputs are consecutive frames, `middle` is often the **optimal choice** because that it has maximum overlap with all other frames.


## ๐Ÿ’ป Usage

### ๐Ÿ Python API

```python
from depth_anything_3 import DepthAnything3

model = DepthAnything3.from_pretrained("depth-anything/DA3NESTED-GIANT-LARGE")

# Use default (saddle_balanced)
prediction = model.inference(
    images,
    ref_view_strategy="saddle_balanced"
)

# For video sequences, consider using middle
prediction = model.inference(
    video_frames,
    ref_view_strategy="middle"  # Good for temporal sequences
)

# For complex scenes with wide baselines
prediction = model.inference(
    images,
    ref_view_strategy="saddle_sim_range"
)
```

### ๐Ÿ–ฅ๏ธ Command Line Interface

```bash
# Default (saddle_balanced)
da3 auto input/ --export-dir output/

# Explicitly specify strategy
da3 auto input/ --ref-view-strategy saddle_balanced

# For video processing
da3 video input.mp4 --ref-view-strategy middle

# For wide-baseline multi-view
da3 images captures/ --ref-view-strategy saddle_sim_range
```

---

### ๐ŸŽฏ When Selection Is Applied

Reference view selection is applied when:
- 3๏ธโƒฃ Number of views S โ‰ฅ 3

---

## ๐Ÿ’ก Recommendations

### ๐Ÿ“‹ Quick Guide

| Scenario | Recommended Strategy | Rationale |
|----------|---------------------|-----------|
| **Default / Unknown** | `saddle_balanced` | Robust, balanced, works well across diverse scenarios |
| **Video frames** | `middle` | Temporal coherence, stable middle frame |
| **Wide-baseline multi-view** | `saddle_sim_range` | Maximizes information coverage |
| **Pre-sorted inputs** | `first` | Use only if you've manually optimized ordering |
| **Single image** | `first` | Automatically used (no reordering needed for S โ‰ค 2) |

### โœจ Best Practices

1. ๐ŸŽฏ **Start with defaults**: `saddle_balanced` works well in most cases
2. ๐ŸŽฌ **Consider your input type**: Use `middle` for videos, `saddle_balanced` for photos
3. ๐Ÿ”ฌ **Experiment if needed**: Try different strategies if results are suboptimal
4. ๐Ÿ“Š **Monitor performance**: Check `glb` quality and consistency across views.

---

## ๐Ÿ”ง Technical Details

### ๐ŸŽš๏ธ Selection Threshold

The reference view selection is only triggered when:
```python
num_views >= 3  # At least 3 views required
```

For 1-2 views, no reordering is performed (equivalent to using `first`).

### โš™๏ธ Implementation

The selection happens at layer `alt_start - 1` in the vision transformer, before the first global attention layer. This ensures the selected reference view influences the entire depth prediction pipeline.

---

## โ“ FAQ

**Q: ๐Ÿค” Why is this feature provided?**  
A: The model can handle any view order, but this feature provides automatic optimization for reference view selection, which can help improve depth prediction quality in multi-view scenarios.

**Q: โฑ๏ธ Does this add computational cost?**  
A: The overhead is totally negligible.

**Q: ๐ŸŽฎ Can I manually specify which view to use as reference?**  
A: Not directly through this parameter. You can pre-sort your input images to place your preferred reference view first and use `ref_view_strategy="first"`.

**Q: โš™๏ธ What happens if I don't specify this parameter?**  
A: The default `saddle_balanced` strategy is used automatically.

**Q: ๐Ÿ“Š Is this feature used in the DA3 paper benchmarks?**  
A: No, the paper used `first` as the default strategy for all multi-view experiments. The current default has been updated to `saddle_balanced` for better robustness.