arxiv:2605.31577

SurGe: Improved Surface Geometry in Point Maps

Published on May 29

· Submitted by

Karim Knaebel on Jun 1

RWTH Computer Vision Group

Upvote

Authors:

Karim Knaebel ,

Gonzalo Martin Garcia ,

Abstract

SurGe improves 3D reconstruction accuracy by introducing a point map normal metric and combining point gradient matching loss with Neighborhood Attention Decoder for better local surface geometry estimation.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Recent feedforward 3D reconstruction methods predict point maps and estimate global 3D geometry remarkably well. However, their predictions still exhibit inaccurate local surface geometry, which is clearly visible qualitatively but only weakly reflected in common metrics. To make these errors more explicit in evaluation, we introduce a point map normal metric that evaluates the local surface orientation induced by neighboring 3D predictions. To reduce these errors, we propose two complementary components: a point gradient matching loss that supervises depth-normalized 3D finite differences, and a Neighborhood Attention Decoder (NAD) that progressively upsamples features and uses Neighborhood Attention for local feature mixing. Across eight zero-shot monocular geometry benchmarks, our model, SurGe, achieves the best average rank for global point map AbsRel and consistently improves local point map and point map normal evaluations.

View arXiv page View PDF Project page GitHub 33 Add to collection

Community

karimknaebel

Paper author Paper submitter about 17 hours ago

We improve local accuracy in feedforward 3D reconstruction. Current point map models struggle with bending and oscillating artifacts for thin structures (chair legs, street lamps, etc). Easy to spot visually, but not well captured by pointwise metrics like AbsRel.

We use a Neighborhood Attention Decoder (NAD). Like DPT-style heads, it decodes point maps progressively across scales, but it replaces conv-based local mixing with neighborhood attention and window-matched RoPE in ViT-like blocks.

This gives content-dependent local mixing without full self-attention at pixel-resolution. In practice, it helps with thin structures and discontinuities, while also avoiding the patch artifacts we see with plain ViT/MLP decoders.

We also reformulate scale-invariant gradient matching for point maps. This family of losses worked best for us for when the main global error is relative. Our version keeps the pairwise scale-invariant behavior, but is directly applicable to points instead of scalar depth only.

For evaluation, we suggest a point map normal mean angular error as a complementary metric alongside global and local AbsRel. We compute normals from neighboring predicted 3D points and report the angular difference to the GT. Empirically, this matches our qualitative impression better.

On zero-shot monocular geometry benchmarks, SurGe gets the best average rank for global point map AbsRel among SotA methods. More importantly, it improves local point map and point map normal metrics, suggesting better local surface geometry. It matches what we see qualitatively.