Papers
arxiv:2606.29600

One Scene, Two Depths: Probing Geometric Ambiguity in Monocular Foundation Models

Published on Jun 28
· Submitted by
Xiaohao
on Jun 30
Authors:
,
,
,
,
,
,
,

Abstract

A faithful 3D world representation should account for layered geometry, where a single camera ray may contain multiple visible and geometrically valid surfaces. Monocular depth estimation, however, reduces this structure to one scalar depth per pixel. Transparent scenes make this ambiguity measurable: the same ray can pass through foreground glass and observe the background, turning the supervised target into a convention of annotation, data, and training rather than a scene-intrinsic truth. A learned predictor exposes this convention as its depth-layer preference. We introduce MultiDepth-3k (MD-3k), a sparse two-layer ordinal benchmark for measuring depth-layer preference and multi-layer spatial relationship accuracy (ML-SRA). On MD-3k, leading depth foundation models exhibit diverse layer preferences under standard RGB input, showing that the same layered geometry can be resolved differently across models. We further find that Laplacian Visual Prompting (LVP), a training-free spectral input transformation, can substantially change the reported layer for certain frozen models. The strongest RGB/LVP pair, DAv2-L, reaches 75.5% ML-SRA. These results suggest that depth foundation models may express complementary geometric hypotheses that standard RGB inference leaves unexpressed. We invite the community to rethink depth supervision and evaluation through an ambiguity-aware lens, where multiple valid 3D interpretations are treated as geometric structure to be measured, preserved, and expressed.

Community

Paper submitter

One Scene, Two Depths studies a simple but overlooked question in monocular depth foundation models: under layered visibility, when one visual ray contains multiple visible and geometrically valid depths, which depth does the model choose?

Our key view is that single-depth prediction under ambiguity exposes a model’s depth-layer preference, rather than an unbiased scene-intrinsic truth. The label itself can become a convention shaped by sensors, annotation, datasets, training mixtures, and evaluation metrics.

We introduce MultiDepth-3k (MD-3k), a real-world transparent-scene benchmark with sparse two-layer ordinal annotations, to measure whether a model reports the transparent foreground or the visible background. We further propose Laplacian Visual Prompting (LVP), a training-free spectral input transformation that queries the same frozen model differently.

A key finding is that some frozen single-output depth models can express complementary depth hypotheses under RGB vs. LVP inputs. On MD-3k, the strongest RGB/LVP pair reaches 75.5% ML-SRA, above the strict 56.4% duplicated single-hypothesis ceiling, and reaches 52.2% on Reverse cases where one depth map cannot satisfy both valid layer relations by construction.

The broader implication is that single-depth prediction may be an incomplete interface for learned 3D world models: standard RGB inference may reveal only one preferred slice of richer multi-layer geometric knowledge.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.29600
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.29600 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.29600 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.29600 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.