Abstract
SurGe improves 3D reconstruction accuracy by introducing a point map normal metric and combining point gradient matching loss with Neighborhood Attention Decoder for better local surface geometry estimation.
Recent feedforward 3D reconstruction methods predict point maps and estimate global 3D geometry remarkably well. However, their predictions still exhibit inaccurate local surface geometry, which is clearly visible qualitatively but only weakly reflected in common metrics. To make these errors more explicit in evaluation, we introduce a point map normal metric that evaluates the local surface orientation induced by neighboring 3D predictions. To reduce these errors, we propose two complementary components: a point gradient matching loss that supervises depth-normalized 3D finite differences, and a Neighborhood Attention Decoder (NAD) that progressively upsamples features and uses Neighborhood Attention for local feature mixing. Across eight zero-shot monocular geometry benchmarks, our model, SurGe, achieves the best average rank for global point map AbsRel and consistently improves local point map and point map normal evaluations.
Community
We improve local accuracy in feedforward 3D reconstruction. Current point map models struggle with bending and oscillating artifacts for thin structures (chair legs, street lamps, etc). Easy to spot visually, but not well captured by pointwise metrics like AbsRel.
We use a Neighborhood Attention Decoder (NAD). Like DPT-style heads, it decodes point maps progressively across scales, but it replaces conv-based local mixing with neighborhood attention and window-matched RoPE in ViT-like blocks.
This gives content-dependent local mixing without full self-attention at pixel-resolution. In practice, it helps with thin structures and discontinuities, while also avoiding the patch artifacts we see with plain ViT/MLP decoders.
We also reformulate scale-invariant gradient matching for point maps. This family of losses worked best for us for when the main global error is relative. Our version keeps the pairwise scale-invariant behavior, but is directly applicable to points instead of scalar depth only.
For evaluation, we suggest a point map normal mean angular error as a complementary metric alongside global and local AbsRel. We compute normals from neighboring predicted 3D points and report the angular difference to the GT. Empirically, this matches our qualitative impression better.
On zero-shot monocular geometry benchmarks, SurGe gets the best average rank for global point map AbsRel among SotA methods. More importantly, it improves local point map and point map normal metrics, suggesting better local surface geometry. It matches what we see qualitatively.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction (2026)
- Unified Panoramic Geometry Estimation via Multi-View Foundation Models (2026)
- IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation (2026)
- VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching (2026)
- OCH3R: Object-Centric Holistic 3D Reconstruction (2026)
- Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction (2026)
- Feed-Forward Gaussian Splatting from Sparse Aerial Views (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.31577 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper



