Document the 9 head architectures and arena protocol
Browse files
README.md
CHANGED
|
@@ -18,3 +18,27 @@ A systematic study of monocular depth head architectures operating on frozen vis
|
|
| 18 |
Standard practice treats the backbone and depth decoder as a joint system. Recent universal encoders produce spatial features of sufficient quality that the backbone can remain frozen while a lightweight head is trained on depth data. Under this regime, the head is the only variable.
|
| 19 |
|
| 20 |
This repository contains an arena framework for rapid comparison of depth head candidates and a collection of architectures spanning conventional decoders through novel minimal-parameter designs. All heads consume the same spatial feature tensor and produce per-pixel depth maps. The reference backbone is [EUPE-ViT-B](https://huggingface.co/facebook/EUPE-ViT-B) (86M parameters, frozen), but the framework is backbone-agnostic — the same heads can be evaluated against any frozen ViT that produces a stride-16 spatial feature grid.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
Standard practice treats the backbone and depth decoder as a joint system. Recent universal encoders produce spatial features of sufficient quality that the backbone can remain frozen while a lightweight head is trained on depth data. Under this regime, the head is the only variable.
|
| 19 |
|
| 20 |
This repository contains an arena framework for rapid comparison of depth head candidates and a collection of architectures spanning conventional decoders through novel minimal-parameter designs. All heads consume the same spatial feature tensor and produce per-pixel depth maps. The reference backbone is [EUPE-ViT-B](https://huggingface.co/facebook/EUPE-ViT-B) (86M parameters, frozen), but the framework is backbone-agnostic — the same heads can be evaluated against any frozen ViT that produces a stride-16 spatial feature grid.
|
| 21 |
+
|
| 22 |
+
## Heads
|
| 23 |
+
|
| 24 |
+
Nine architectures, all consuming a `[B, 768, H, W]` spatial feature tensor and producing `[B, 1, H_out, W_out]` metric depth in meters over the 0.001–10 range. Each head lives in its own folder under `heads/` with a single `head.py` implementation.
|
| 25 |
+
|
| 26 |
+
| Name | Architecture | Parameters |
|
| 27 |
+
|------|-------------|------------|
|
| 28 |
+
| `linear_probe` | BatchNorm + 1×1 conv → 256 depth bins, weighted-sum decode. The EUPE paper baseline. | ~199K |
|
| 29 |
+
| `cofiber_linear` | Cofiber decomposition + shared 1×1 conv per scale → 256-bin decode | ~197K |
|
| 30 |
+
| `cofiber_threshold` | Cofiber decomposition + per-scale LayerNorm + prototype prediction → 256-bin decode | ~202K |
|
| 31 |
+
| `wavelet` | Haar wavelet decomposition + per-subband prediction → 256-bin decode | ~590K |
|
| 32 |
+
| `log_linear` | Single 1×1 conv predicting log-depth, exponentiated and clamped | 769 |
|
| 33 |
+
| `ordinal_regression` | K independent threshold classifiers, depth = sum of positive predictions × bin width | ~49K |
|
| 34 |
+
| `multiscale_gradient` | Per-scale depth gradient prediction on cofiber bands, integrated for absolute depth | ~6K |
|
| 35 |
+
| `harmonic` | Cofiber edge detection + boundary depth prediction + Jacobi Laplace solve at non-edge locations | 770 |
|
| 36 |
+
| `renormalization` | Depth from per-scale cofiber energy weighted sum (one weight per scale) | 6 |
|
| 37 |
+
|
| 38 |
+
## Arena Framework
|
| 39 |
+
|
| 40 |
+
`arena.py` runs any head by name against cached NYU Depth V2 backbone features. The arena pre-extracts features once, then each candidate trains and evaluates without touching the backbone again. Training is SILog loss at 416×416 resolution against the indoor depth label space; evaluation reports root mean squared error (RMSE) on the test split.
|
| 41 |
+
|
| 42 |
+
## Status
|
| 43 |
+
|
| 44 |
+
Heads are implemented and importable through the `heads/` registry. The arena screening sweep across all 9 heads has not yet been run on a fresh NYU Depth V2 cache; results will be published here when available.
|