Depth Heads

A systematic study of monocular depth head architectures operating on frozen vision transformer features. Given a dense spatial feature grid from any frozen backbone, what is the most parameter-efficient architecture for per-pixel metric depth prediction?

Standard practice treats the backbone and depth decoder as a joint system. Recent universal encoders produce spatial features of sufficient quality that the backbone can remain frozen while a lightweight head is trained on depth data. Under this regime, the head is the only variable.

This repository contains an arena framework for rapid comparison of depth head candidates and a collection of architectures spanning conventional decoders through novel minimal-parameter designs. All heads consume the same spatial feature tensor and produce per-pixel depth maps. The reference backbone is EUPE-ViT-B (86M parameters, frozen), but the framework is backbone-agnostic — the same heads can be evaluated against any frozen ViT that produces a stride-16 spatial feature grid.

Heads

Nine architectures, all consuming a [B, 768, H, W] spatial feature tensor and producing [B, 1, H_out, W_out] metric depth in meters over the 0.001–10 range. Each head lives in its own folder under heads/ with a single head.py implementation.

Name	Architecture	Parameters
`linear_probe`	BatchNorm + 1×1 conv → 256 depth bins, weighted-sum decode. The EUPE paper baseline.	~199K
`cofiber_linear`	Cofiber decomposition + shared 1×1 conv per scale → 256-bin decode	~197K
`cofiber_threshold`	Cofiber decomposition + per-scale LayerNorm + prototype prediction → 256-bin decode	~202K
`wavelet`	Haar wavelet decomposition + per-subband prediction → 256-bin decode	~590K
`log_linear`	Single 1×1 conv predicting log-depth, exponentiated and clamped	769
`ordinal_regression`	K independent threshold classifiers, depth = sum of positive predictions × bin width	~49K
`multiscale_gradient`	Per-scale depth gradient prediction on cofiber bands, integrated for absolute depth	~6K
`harmonic`	Cofiber edge detection + boundary depth prediction + Jacobi Laplace solve at non-edge locations	770
`renormalization`	Depth from per-scale cofiber energy weighted sum (one weight per scale)	6

Arena Framework

arena.py runs any head by name against cached NYU Depth V2 backbone features. The arena pre-extracts features once, then each candidate trains and evaluates without touching the backbone again. Training is SILog loss at 416×416 resolution against the indoor depth label space; evaluation reports root mean squared error (RMSE) on the test split.

Status

Heads are implemented and importable through the heads/ registry. The arena screening sweep across all 9 heads has not yet been run on a fresh NYU Depth V2 cache; results will be published here when available.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Depth Estimation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for phanerozoic/depth-heads

Base model

facebook/EUPE-ViT-B

Finetuned

(6)

this model

phanerozoic
/

depth-heads

Depth Heads

Heads

Arena Framework

Status

Model tree for phanerozoic/depth-heads

Dataset used to train phanerozoic/depth-heads