phanerozoic commited on
Commit
270650f
·
verified ·
1 Parent(s): a2860cd

Document the 9 head architectures and arena protocol

Browse files
Files changed (1) hide show
  1. README.md +24 -0
README.md CHANGED
@@ -18,3 +18,27 @@ A systematic study of monocular depth head architectures operating on frozen vis
18
  Standard practice treats the backbone and depth decoder as a joint system. Recent universal encoders produce spatial features of sufficient quality that the backbone can remain frozen while a lightweight head is trained on depth data. Under this regime, the head is the only variable.
19
 
20
  This repository contains an arena framework for rapid comparison of depth head candidates and a collection of architectures spanning conventional decoders through novel minimal-parameter designs. All heads consume the same spatial feature tensor and produce per-pixel depth maps. The reference backbone is [EUPE-ViT-B](https://huggingface.co/facebook/EUPE-ViT-B) (86M parameters, frozen), but the framework is backbone-agnostic — the same heads can be evaluated against any frozen ViT that produces a stride-16 spatial feature grid.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  Standard practice treats the backbone and depth decoder as a joint system. Recent universal encoders produce spatial features of sufficient quality that the backbone can remain frozen while a lightweight head is trained on depth data. Under this regime, the head is the only variable.
19
 
20
  This repository contains an arena framework for rapid comparison of depth head candidates and a collection of architectures spanning conventional decoders through novel minimal-parameter designs. All heads consume the same spatial feature tensor and produce per-pixel depth maps. The reference backbone is [EUPE-ViT-B](https://huggingface.co/facebook/EUPE-ViT-B) (86M parameters, frozen), but the framework is backbone-agnostic — the same heads can be evaluated against any frozen ViT that produces a stride-16 spatial feature grid.
21
+
22
+ ## Heads
23
+
24
+ Nine architectures, all consuming a `[B, 768, H, W]` spatial feature tensor and producing `[B, 1, H_out, W_out]` metric depth in meters over the 0.001–10 range. Each head lives in its own folder under `heads/` with a single `head.py` implementation.
25
+
26
+ | Name | Architecture | Parameters |
27
+ |------|-------------|------------|
28
+ | `linear_probe` | BatchNorm + 1×1 conv → 256 depth bins, weighted-sum decode. The EUPE paper baseline. | ~199K |
29
+ | `cofiber_linear` | Cofiber decomposition + shared 1×1 conv per scale → 256-bin decode | ~197K |
30
+ | `cofiber_threshold` | Cofiber decomposition + per-scale LayerNorm + prototype prediction → 256-bin decode | ~202K |
31
+ | `wavelet` | Haar wavelet decomposition + per-subband prediction → 256-bin decode | ~590K |
32
+ | `log_linear` | Single 1×1 conv predicting log-depth, exponentiated and clamped | 769 |
33
+ | `ordinal_regression` | K independent threshold classifiers, depth = sum of positive predictions × bin width | ~49K |
34
+ | `multiscale_gradient` | Per-scale depth gradient prediction on cofiber bands, integrated for absolute depth | ~6K |
35
+ | `harmonic` | Cofiber edge detection + boundary depth prediction + Jacobi Laplace solve at non-edge locations | 770 |
36
+ | `renormalization` | Depth from per-scale cofiber energy weighted sum (one weight per scale) | 6 |
37
+
38
+ ## Arena Framework
39
+
40
+ `arena.py` runs any head by name against cached NYU Depth V2 backbone features. The arena pre-extracts features once, then each candidate trains and evaluates without touching the backbone again. Training is SILog loss at 416×416 resolution against the indoor depth label space; evaluation reports root mean squared error (RMSE) on the test split.
41
+
42
+ ## Status
43
+
44
+ Heads are implemented and importable through the `heads/` registry. The arena screening sweep across all 9 heads has not yet been run on a fresh NYU Depth V2 cache; results will be published here when available.