Buckets:
Title: RNb-NeuS: Reflectance and Normal-based Multi-View 3D Reconstruction
URL Source: https://arxiv.org/html/2312.01215
Published Time: Thu, 02 May 2024 20:19:14 GMT
Markdown Content: Robin Bruneau 1,2,1 1 footnotemark: 1 Yvain Quéau 3 Jean Mélou 1 François Bernard Lauze 2 Jean-Denis Durou 1 Lilian Calvet 4
1 IRIT, UMR CNRS 5505, Toulouse, France
2 DIKU, Copenhagen, Denmark
3 Normandie Univ, UNICAEN, ENSICAEN, CNRS, GREYC, Caen, France
4 OR-X, Balgrist Hospital, University of Zurich, Zürich, Switzerland
Abstract
This paper introduces a versatile paradigm for integrating multi-view reflectance (optional) and normal maps acquired through photometric stereo. Our approach employs a pixel-wise joint re-parameterization of reflectance and normal, considering them as a vector of radiances rendered under simulated, varying illumination. This re-parameterization enables the seamless integration of reflectance and normal maps as input data in neural volume rendering-based 3D reconstruction while preserving a single optimization objective. In contrast, recent multi-view photometric stereo (MVPS) methods depend on multiple, potentially conflicting objectives. Despite its apparent simplicity, our proposed approach outperforms state-of-the-art approaches in MVPS benchmarks across F-score, Chamfer distance, and mean angular error metrics. Notably, it significantly improves the detailed 3D reconstruction of areas with high curvature or low visibility.
1 Introduction
Automatic 3D reconstruction is pivotal in various fields, such as archaeological and cultural heritage (virtual reconstruction), medical imaging (surgical planning), virtual and augmented reality, games and film production.
Multi-view stereo (MVS)[5], which retrieves the geometry of a scene seen from multiple viewpoints, is the most famous 3D reconstruction solution. Coupled with neural volumetric rendering (NVR) techniques[23], it effectively handles complex structures and self-occlusions. However, dealing with non-Lambertian scenes remains a challenge due to the breakdown of the underlying brightness consistency assumption. The problem is also ill-posed in certain configurations e.g., poorly textured scene[26] or degenerate viewpoints configurations with limited baselines. Moreover, despite recent efforts in this direction[14], recovering the thinnest geometric details remains difficult under fixed illumination. In such a setting, estimating the reflectance of the scene also remains a challenge.
Figure 1: One image from DiLiGenT-MV’s Buddha dataset[13], and 3D reconstruction results from several recent MVPS methods: [27, 12, 28] and ours. The latter provides the fine details closest to the ground truth (GT), while being remarkably simpler.
On the other hand, photometric stereo (PS)[25], which relies on a collection of images acquired under varying lighting, excels in the recovery of high-frequency details under the form of normal maps. It is also the only photographic technique that can estimate reflectance. And, with the recent advent of deep learning techniques[8], PS gained enough maturity to handle non-Lambertian surfaces and complex illumination. Yet, its reconstruction of geometry’s low frequencies remains suboptimal.
Given these complementary characteristics, the integration of MVS and PS seems natural. This integration, known as multi-view photometric stereo (MVPS), aims to reconstruct geometry from multiple views and illumination conditions. Recent MVPS solutions jointly solve MVS and PS within a multi-objective optimization, potentially losing the thinnest details due to the possible incompatibility of these objectives – see Fig.1. In this work, we explore a simpler route for solving MVPS by decoupling the two problems.
We start with the observation that recent PS techniques deliver exceptionally high-quality reflectance and normal maps, which we use as input data. To accurately reconstruct the surface reflectance and geometry, we need to fuse these maps, a challenging task within a single-objective optimization due to their inhomogeneity. Our method provides a solution to this problem by combining NVR with a simple and effective pixel-wise re-parameterization.
In this method, the input reflectance and normal for each pixel are merged into a vector of radiances simulated under arbitrary, varying illumination. We then adapt an NVR pipeline to optimize the consistency of these simulations wrt to the scene reflectance and geometry, modeled as the zero-level set of a trained signed distance function (SDF). Coupled with a state-of-the-art PS method such as[8] for obtaining the input reflectance and normals, this approach yields an MVPS pipeline reaching an unprecedented level of fine details, as illustrated in Fig.1. Besides being the first to exploit reflectance as a prior, our proposed MVPS paradigm is extremely versatile, compatible with any existing or future PS method, whether calibrated or uncalibrated, deep learning-based, or classic optimization procedures.
The rest of this work is organized as follows. Sect.2 discusses state-of-the-art MVPS methods. The proposed 3D reconstruction from reflectance and normals is detailed in Sect.3. Sect.4 then sketches a proposal for an MVPS algorithm based on this approach. Sect.5 extensively evaluates this algorithm, before our conclusions are drawn in Sect.6.
2 Related work
Classical methods
The first paper to deal with MVPS is by Hernandez et al. [6]. To avoid having to arbitrate the conflicts between the different normal maps, a 3D mesh is iteratively deformed, starting from the visual hull until the images recomputed using the Lambertian model match the original images, while penalizing the discrepancy between the PS normals and those of the 3D mesh. No prior knowledge of camera poses or illumination is required. Under the same assumptions, Park et al. [20, 21] start from a 3D mesh obtained by SfM (structure-from-motion) and MVS. Simultaneous estimation of reflectance, normals and illumination is achieved by uncalibrated PS, using the normals from the 3D mesh to remove the ambiguity, and estimating the details of the relief through 2D displacement maps.
MVPS is solved for the first time with a SDF representation of the surface by Logothetis et al. [15]. Therein, illumination is represented as near point light sources which are assumed calibrated, as well as the camera poses. Thanks to a voxel-based implementation, the surface details are better rendered than with the method of Park et al. [21].
Li et al [13] refine a 3D mesh obtained by propagating the SfM points according to [18], and estimate the BRDF using a calibrated setup. The creation of the public dataset “DiLiGenT-MV” validates numerically the improved results, in comparison with those of [21].
Deep learning-based methods
Kaya et al. [11] proposed a solution to MVPS based on neural radiance fields (NeRFs)[17]. For each viewpoint, a normal map is obtained using a pre-trained PS network, before a NeRF is adapted to account for input surface normals from PS in the color function. The recovered geometry yet remains perfectible, according to[10]. Therein, the authors propose learning an SDF function whose zero level set best explains pixel depth and normal maps obtained by a pre-trained MVS[22] or PS network[7], respectively. To manage conflicting objectives in the proposed multi-objective optimization and get the best out of MVS and PS predictions, both networks are modified to output uncertainty measures on depth and normal predictions. The SDF optimization is then carried out while accounting for the inferred uncertainties.
PS-NeRF [27] solves MVPS by jointly estimating the geometry, material and illumination. To this end, the authors propose to regularize the gradient of a UNISURF[19] using the normal maps from PS, while relying on multi-layer perceptrons (MLPs) to explicitly model surface normals, BRDF, illumination, and visibility. These MLPs are optimized based on a shadow-aware differentiable rendering layer. A similar track is followed in[2], where NeRFs are combined with a physically-based differentiable renderer.
Figure 2: Overview of the proposed MVPS pipeline. The reflectance and normal maps provided for each view by PS are fused, by combining volume rendering with a pixel-wise re-parameterization of the inputs using physically-based rendering.
Such NeRF-based approaches provide undeniably better 3D reconstructions than classical methods, yet they remain computationally intensive. Recently, Zhao et al.[28] proposed a fast deep learning-based solution to MVPS. Aggregated shading patterns are matched across viewpoints so that to predict pixel depths and normal maps.
In [12], the authors proposed to complement [10] by adding a NVR loss term in order to benefit from the reliability of NVR in reconstructing objects with diverse material types. However, this results in a multi-objective optimization comprising three loss terms (besides the Eikonal term). However, similar to [10], the uncertainty-based hyper-parameter tuning does not completely eliminate conflicting objectives, which may induce a loss of fine-scale details. In contrast, we propose a single objective optimization based on an ad hoc re-parametrization which leads to the seamless integration of PS results in standard NVR pipelines. This is detailed in the next paragraph.
3 Proposed approach
Our aim is to infer a surface whose geometric and photometric properties are consistent with the per-view PS results. To do so, we resort to a volume rendering framework coupled with a re-parameterization of the inputs, as illustrated in Fig.2 and detailed in the rest of this section.
3.1 Overview
Input data
From the N 𝑁 N italic_N image sets captured under fixed viewpoint and varying illumination, PS provides N 𝑁 N italic_N reflectance and normal maps, out of which we extract a batch of m 𝑚 m italic_m posed reflectance and normal values {r k∈ℝ,𝐧 k∈𝕊 2}k=1…m subscript formulae-sequence subscript 𝑟 𝑘 ℝ subscript 𝐧 𝑘 superscript 𝕊 2 𝑘 1…𝑚{r_{k}\in\mathbb{R},\mathbf{n}{k}\in\mathbb{S}^{2}}{k=1\dots m}{ italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R , bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 … italic_m end_POSTSUBSCRIPT. Here, the normal vectors are expressed in world coordinates using the known camera poses. The input reflectance is without loss of generality represented by a scalar (albedo). Let us emphasize that this assumption does not imply that the observed scene must be Lambertian, but rather that we use only the diffuse component of the estimated reflectance. Using other reflectance components (specularity, roughness, etc.), if available, would represent a straightforward extension to more evolved physically-based rendering (PBR) models. Yet, we leave such an extension to perspective for now, since there are few PS methods reliably providing such data. Also, if the PS method provides no reflectance, one can set r k≡1 subscript 𝑟 𝑘 1 r_{k}\equiv 1 italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≡ 1 and use the proposed framework for multi-view normal integration.
Surface parameterization
Our aim is to infer a 3D model of a scene, which consists of both a geometric map f:ℝ 3→ℝ:𝑓→superscript ℝ 3 ℝ f!:\mathbb{R}^{3}\to\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT → blackboard_R and a photometric one ρ:ℝ 3→ℝ:𝜌→superscript ℝ 3 ℝ\rho!:\mathbb{R}^{3}\to\mathbb{R}italic_ρ : blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT → blackboard_R. Therein, f 𝑓 f italic_f associates a 3D point with its signed distance to the surface, which is thus given by the zero level set of f 𝑓 f italic_f: 𝒮={𝐱∈ℝ 3|f(𝐱)=0}𝒮 conditional-set 𝐱 superscript ℝ 3 𝑓 𝐱 0\mathcal{S}={\mathbf{x}\in\mathbb{R}^{3},|,f(\mathbf{x})=0}caligraphic_S = { bold_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT | italic_f ( bold_x ) = 0 }. Regarding ρ 𝜌\rho italic_ρ, it encodes the reflectance associated with a 3D point. For input consistency, ρ 𝜌\rho italic_ρ is considered as a scalar function (albedo), though more advanced PBR models could again be incorporated.
Objective function
Our method builds upon a re-parameterization 𝐯:𝕊 2×ℝ→ℝ n:𝐯→superscript 𝕊 2 ℝ superscript ℝ 𝑛\mathbf{v}!:\mathbb{S}^{2}\times\mathbb{R}\to\mathbb{R}^{n}bold_v : blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × blackboard_R → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT which combines a surface normal 𝐧 k∈𝕊 2 subscript 𝐧 𝑘 superscript 𝕊 2\mathbf{n}{k}\in\mathbb{S}^{2}bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and a reflectance value r k∈ℝ subscript 𝑟 𝑘 ℝ r{k}\in\mathbb{R}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R into a vector 𝐯(𝐧 k,r k)∈ℝ n 𝐯 subscript 𝐧 𝑘 subscript 𝑟 𝑘 superscript ℝ 𝑛\mathbf{v}(\mathbf{n}{k},r{k})\in\mathbb{R}^{n}bold_v ( bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT of n 𝑛 n italic_n radiance values that are simulated by physically-based rendering, using an arbitrary image formation model under varying illumination. Given this re-parameterization, the 3D reconstruction problem amounts to minimizing the difference between a batch of m 𝑚 m italic_m intensity vectors simulated either from the input data or from volume rendering with the same PBR model, along with a regularization on the SDF:
min f,ρ∑k=1 m‖𝐯(𝐧 k,r k)−𝐯k(f,ρ)‖1+λℒ reg(f).𝑓 𝜌 superscript subscript 𝑘 1 𝑚 subscript norm 𝐯 subscript 𝐧 𝑘 subscript 𝑟 𝑘 subscript𝐯 𝑘 𝑓 𝜌 1 𝜆 subscript ℒ reg 𝑓\underset{f,\rho}{\min}\sum_{k=1}^{m}|\mathbf{v}(\mathbf{n}{k},r{k})-\tilde% {\mathbf{v}}{k}(f,\rho)|{1}+\lambda,\mathcal{L}_{\text{reg}}(f).start_UNDERACCENT italic_f , italic_ρ end_UNDERACCENT start_ARG roman_min end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_v ( bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_f , italic_ρ ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT ( italic_f ) .(1)
Here, {(𝐧 k,r k)}k=1…m subscript subscript 𝐧 𝑘 subscript 𝑟 𝑘 𝑘 1…𝑚{(\mathbf{n}{k},r{k})}{k=1\dots m}{ ( bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k = 1 … italic_m end_POSTSUBSCRIPT stands for the batch of input reflectance and normal values, 𝐯(𝐧 k,r k)𝐯 subscript 𝐧 𝑘 subscript 𝑟 𝑘\mathbf{v}(\mathbf{n}{k},r_{k})bold_v ( bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) for the k 𝑘 k italic_k-th intensity vector simulated from the input data, 𝐯k(f,ρ)subscript𝐯 𝑘 𝑓 𝜌\tilde{\mathbf{v}}{k}(f,\rho)over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_f , italic_ρ ) for the corresponding one simulated by volume rendering, and λ>0 𝜆 0\lambda>0 italic_λ > 0 is a tunable hyper-parameter for balancing the data fidelity with the regularizer ℒ reg subscript ℒ reg\mathcal{L}{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT. The actual optimization can then be carried out seamlessly by resorting to a volume rendering-based 3D reconstruction pipeline such as NeuS[23], given that both 𝐯k(f,ρ)subscript𝐯 𝑘 𝑓 𝜌\tilde{\mathbf{v}}{k}(f,\rho)over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_f , italic_ρ ) and 𝐯(𝐧 k,r k)𝐯 subscript 𝐧 𝑘 subscript 𝑟 𝑘\mathbf{v}(\mathbf{n}{k},r_{k})bold_v ( bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) correspond to pixel intensities. Let us now detail how we simulate the latter intensities 𝐯(𝐧 k,r k)𝐯 subscript 𝐧 𝑘 subscript 𝑟 𝑘\mathbf{v}(\mathbf{n}{k},r{k})bold_v ( bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) from the input reflectance and normal data.
3.2 Reflectance and normal re-parameterization
The input reflectance {r k∈ℝ}k subscript subscript 𝑟 𝑘 ℝ 𝑘{r_{k}\in\mathbb{R}}{k}{ italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R } start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and normals {𝐧 k∈𝕊 2}k subscript subscript 𝐧 𝑘 superscript 𝕊 2 𝑘{\mathbf{n}{k}\in\mathbb{S}^{2}}_{k}{ bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT values constitute inhomogeneous quantities: the former are photometric scalars, and the latter geometric vectors lying on the three-dimensional unit sphere. Direct optimization of their consistency with the scene normal ∇f‖∇f‖∇𝑓 norm∇𝑓\frac{\nabla f}{|\nabla f|}divide start_ARG ∇ italic_f end_ARG start_ARG ∥ ∇ italic_f ∥ end_ARG and albedo ρ 𝜌\rho italic_ρ would lead to multiple objectives balanced by hyper-parameters.
Instead, we propose to jointly re-parameterize the reflectance and normal data into a set of vectors {𝐯(𝐧 k,r k)∈ℝ n}k subscript 𝐯 subscript 𝐧 𝑘 subscript 𝑟 𝑘 superscript ℝ 𝑛 𝑘{\mathbf{v}(\mathbf{n}{k},r{k})\in\mathbb{R}^{n}}{k}{ bold_v ( bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of homogeneous quantities, namely radiance values simulated using a PBR model under varying illumination. In order to enforce the bijectivity of this re-parameterization, we choose as PBR model the linear Lambertian one, under pixel-wise varying illumination represented by n=3 𝑛 3 n=3 italic_n = 3 arbitrary illumination vectors 𝐥 k,1,𝐥 k,2,𝐥 k,3∈ℝ 3 subscript 𝐥 𝑘 1 subscript 𝐥 𝑘 2 subscript 𝐥 𝑘 3 superscript ℝ 3\mathbf{l}{k,1},\mathbf{l}{k,2},\mathbf{l}{k,3}\in\mathbb{R}^{3}bold_l start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT , bold_l start_POSTSUBSCRIPT italic_k , 2 end_POSTSUBSCRIPT , bold_l start_POSTSUBSCRIPT italic_k , 3 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT:
𝐯(𝐧 k,r k)𝐯 subscript 𝐧 𝑘 subscript 𝑟 𝑘\displaystyle\mathbf{v}(\mathbf{n}{k},r{k})bold_v ( bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )=r k[𝐧 k⊤𝐥 k,1,𝐧 k⊤𝐥 k,2,𝐧 k⊤𝐥 k,3]⊤absent subscript 𝑟 𝑘 superscript superscript subscript 𝐧 𝑘 top subscript 𝐥 𝑘 1 superscript subscript 𝐧 𝑘 top subscript 𝐥 𝑘 2 superscript subscript 𝐧 𝑘 top subscript 𝐥 𝑘 3 top\displaystyle=r_{k}[\mathbf{n}{k}^{\top}\mathbf{l}{k,1},\mathbf{n}{k}^{\top% }\mathbf{l}{k,2},\mathbf{n}{k}^{\top}\mathbf{l}{k,3}]^{\top}= italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_l start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT , bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_l start_POSTSUBSCRIPT italic_k , 2 end_POSTSUBSCRIPT , bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_l start_POSTSUBSCRIPT italic_k , 3 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT(2) =r k𝖫 k𝐧 k,absent subscript 𝑟 𝑘 subscript 𝖫 𝑘 subscript 𝐧 𝑘\displaystyle=r_{k}\mathsf{L}{k},\mathbf{n}{k},= italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT sansserif_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,
with 𝖫 k=[𝐥 k,1,𝐥 k,2,𝐥 k,3]⊤subscript 𝖫 𝑘 superscript subscript 𝐥 𝑘 1 subscript 𝐥 𝑘 2 subscript 𝐥 𝑘 3 top\mathsf{L}{k}=\left[\mathbf{l}{k,1},\mathbf{l}{k,2},\mathbf{l}{k,3}\right]% ^{\top}sansserif_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = [ bold_l start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT , bold_l start_POSTSUBSCRIPT italic_k , 2 end_POSTSUBSCRIPT , bold_l start_POSTSUBSCRIPT italic_k , 3 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT the arbitrary per-pixel illumination matrix.
For the re-reparameterization to be bijective, the reflectance r k subscript 𝑟 𝑘 r_{k}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT must be non-null (a basic assumption in photographic 3D vision), and 𝖫 k subscript 𝖫 𝑘\mathsf{L}{k}sansserif_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT must be non-singular i.e., the lighting directions must be chosen linearly independent. Then, the original reflectance and normal can be retrieved from the simulated intensities by r k=‖𝖫 k−1𝐯(𝐧 k,r k)‖subscript 𝑟 𝑘 norm superscript subscript 𝖫 𝑘 1 𝐯 subscript 𝐧 𝑘 subscript 𝑟 𝑘 r{k}=|\mathsf{L}{k}^{-1}\mathbf{v}(\mathbf{n}{k},r_{k})|italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∥ sansserif_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_v ( bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ and 𝐧 k=𝖫 k−1𝐯(𝐧 k,r k)‖𝖫 k−1𝐯(𝐧 k,r k)‖subscript 𝐧 𝑘 superscript subscript 𝖫 𝑘 1 𝐯 subscript 𝐧 𝑘 subscript 𝑟 𝑘 norm superscript subscript 𝖫 𝑘 1 𝐯 subscript 𝐧 𝑘 subscript 𝑟 𝑘\mathbf{n}{k}=\frac{\mathsf{L}{k}^{-1}\mathbf{v}(\mathbf{n}{k},r{k})}{|% \mathsf{L}{k}^{-1}\mathbf{v}(\mathbf{n}{k},r_{k})|}bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG sansserif_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_v ( bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ∥ sansserif_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_v ( bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ end_ARG. Considering n>3 𝑛 3 n>3 italic_n > 3 illumination vectors and resorting to the pseudo-inverse operator might induce more robustness but at the price of losing bijectivity and thus not entirely relying on the PS inputs. We leave this as a possible future work, which might be particularly interesting when the PS inputs are uncertain, or when considering more evolved PBR models involving additional reflectance clues such as roughness, anisotropy or specularity.
In practice, the choice of each arbitrary triplet of light directions 𝐥 k,1,𝐥 k,2,𝐥 k,3 subscript 𝐥 𝑘 1 subscript 𝐥 𝑘 2 subscript 𝐥 𝑘 3\mathbf{l}{k,1},\mathbf{l}{k,2},\mathbf{l}{k,3}bold_l start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT , bold_l start_POSTSUBSCRIPT italic_k , 2 end_POSTSUBSCRIPT , bold_l start_POSTSUBSCRIPT italic_k , 3 end_POSTSUBSCRIPT can be made to minimize the uncertainty on the normal estimate. To this end, the illumination triplet proposed in [4] can be considered. Therein, the authors show that the optimal configuration for three images is vectors that are equally spaced in tilt by 120 120 120 120 degrees, with a constant slant of 54.74 54.74 54.74 54.74 degrees (wrt to 𝐧 k subscript 𝐧 𝑘\mathbf{n}{k}bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT).
Let us remark that with the above linear model, it is possible to simulate negative radiance values, when one of the dot products between the normal and the lighting vectors is negative, which corresponds to self-shadowing. While negative radiance values are obviously non physically plausible, this is not a problem for the proposed re-parameterization, as long as it remains consistent with the NVR strategy, which we are now going to detail.
3.3 Volume rendering-based 3D reconstruction
We now turn our attention to deriving the volume rendering function 𝐯k subscript𝐯 𝑘\tilde{\mathbf{v}}{k}over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT arising in Eq.(1). The role of this function is to simulate, from the scene geometry f 𝑓 f italic_f and albedo ρ 𝜌\rho italic_ρ, an intensity vector 𝐯k subscript𝐯 𝑘\tilde{\mathbf{v}}{k}over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT which will be compared with the vector 𝐯 k subscript 𝐯 𝑘\mathbf{v}_{k}bold_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that is simulated from the inputs as described in the previous paragraph.
Our solution largely takes inspiration from the NeuS method[23], that was initially proposed as a solution to the single-light multi-view 3D surface reconstruction problem. Therein, the rendering function follows a volume rendering scheme which accumulates the colors along the ray corresponding to the k 𝑘 k italic_k-th pixel. Denoting by 𝐨 k∈ℝ 3 subscript 𝐨 𝑘 superscript ℝ 3\mathbf{o}{k}\in\mathbb{R}^{3}bold_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT the camera center for this observation, and by 𝐝 k subscript 𝐝 𝑘\mathbf{d}{k}bold_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT the corresponding viewing direction, this ray is written {𝐱 k(t)=𝐨 k+t𝐝 k|t≥0}conditional-set subscript 𝐱 𝑘 𝑡 subscript 𝐨 𝑘 𝑡 subscript 𝐝 𝑘 𝑡 0{\mathbf{x}{k}(t)=\mathbf{o}{k}+t,\mathbf{d}{k},|,t\geq 0}{ bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) = bold_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_t bold_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_t ≥ 0 }. By extending the NeuS volume renderer to the multi-illumination scenario, each coefficient vk,l subscript𝑣 𝑘 𝑙\tilde{v}{k,l}over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT of 𝐯k subscript𝐯 𝑘\tilde{\mathbf{v}}_{k}over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is then given, ∀l∈{1,2,3}for-all 𝑙 1 2 3\forall l\in{1,2,3}∀ italic_l ∈ { 1 , 2 , 3 }, by:
vk,l=∫t n t f w(t,f(𝐱 k(t)))c l(𝐱 k(t))d t,subscript𝑣 𝑘 𝑙 subscript superscript subscript 𝑡 𝑓 subscript 𝑡 𝑛 𝑤 𝑡 𝑓 subscript 𝐱 𝑘 𝑡 subscript 𝑐 𝑙 subscript 𝐱 𝑘 𝑡 differential-d 𝑡\tilde{v}{k,l}=\int^{t{f}}{t{n}}w(t,f(\mathbf{x}{k}(t))),c{l}(\mathbf{x% }_{k}(t)),\mathrm{d}t,over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT = ∫ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w ( italic_t , italic_f ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) ) italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) roman_d italic_t ,(3)
where t n,t f subscript 𝑡 𝑛 subscript 𝑡 𝑓 t_{n},t_{f}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT stand for the range bounds over which the colors are accumulated. The weight function w 𝑤 w italic_w is constructed from the SDF f 𝑓 f italic_f in order to ensure that it is both occlusion-aware and locally maximal on the zero level set, see[23] for details. As for the functions c l:ℝ 3→ℝ:subscript 𝑐 𝑙→superscript ℝ 3 ℝ c_{l}!:\mathbb{R}^{3}\to\mathbb{R}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT → blackboard_R, they represent the scene’s apparent color. In the original NeuS framework, this color depends not only on the 3D locations, but also on the viewing direction 𝐝 k subscript 𝐝 𝑘\mathbf{d}_{k}bold_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and it is directly optimized along with the SDF f 𝑓 f italic_f. Our case, where the albedo is optimized in lieu of the apparent color, and the illumination varies with the data index k 𝑘 k italic_k and the illumination index l 𝑙 l italic_l, is however slightly different.
As a major difference with this prototypical NVR-based 3D reconstruction method, we optimize the SDF f 𝑓 f italic_f and the surface albedo i.e., the scene’s intrinsic color ρ 𝜌\rho italic_ρ rather than its apparent color c l subscript 𝑐 𝑙 c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. The dependency upon the viewing direction must thus be removed, in order to ensure consistency with the Lambertian model used for simulating the inputs. More importantly, contrarily to NeuS where the illumination is fixed, each input data v k,l:=r k𝐧 k⊤𝐥 k,l assign subscript 𝑣 𝑘 𝑙 subscript 𝑟 𝑘 superscript subscript 𝐧 𝑘 top subscript 𝐥 𝑘 𝑙 v_{k,l}:=r_{k}\mathbf{n}{k}^{\top}\mathbf{l}{k,l}italic_v start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT := italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_l start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT is simulated under a different, arbitrary illumination 𝐥 k,l subscript 𝐥 𝑘 𝑙\mathbf{l}{k,l}bold_l start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT. For the NVR to produce simulations vk,l subscript𝑣 𝑘 𝑙\tilde{v}{k,l}over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT matching this input set of intensities, it is necessary to explicitly write the dependency of the apparent color c l subscript 𝑐 𝑙 c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT upon the scene’s geometry f 𝑓 f italic_f, reflectance ρ 𝜌\rho italic_ρ and illumination 𝐥 k,l subscript 𝐥 𝑘 𝑙\mathbf{l}_{k,l}bold_l start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT. Our volume renderer is then still given by Eq.(3), but the color of each 3D point must be replaced by:
c l(𝐱 k(t))=ρ(𝐱 k(t))∇f(𝐱 k(t))⊤𝐥 k,l,subscript 𝑐 𝑙 subscript 𝐱 𝑘 𝑡 𝜌 subscript 𝐱 𝑘 𝑡∇𝑓 superscript subscript 𝐱 𝑘 𝑡 top subscript 𝐥 𝑘 𝑙 c_{l}(\mathbf{x}{k}(t))=\rho(\mathbf{x}{k}(t))\nabla f(\mathbf{x}{k}(t))^{% \top}\mathbf{l}{k,l},italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) = italic_ρ ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_l start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT ,(4)
where the illumination vectors 𝐥 k,l subscript 𝐥 𝑘 𝑙\mathbf{l}_{k,l}bold_l start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT are the same as those in Eq.(2).
Let us remark that the scalar product above corresponds, up to a normalization by ‖∇f(𝐱 k(t))‖norm∇𝑓 subscript 𝐱 𝑘 𝑡|\nabla f(\mathbf{x}{k}(t))|∥ ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) ∥, to the shading. Yet, we do not need to apply this normalization, because the regularization term ℒ reg(f)subscript ℒ reg 𝑓\mathcal{L}{\text{reg}}(f)caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT ( italic_f ) in(1) will take care of ensuring the unit length of ∇f∇𝑓\nabla f∇ italic_f. Indeed, as in the original NeuS framework, the SDF is regularized using an eikonal term:
ℒ reg(f)=∑k=1 m∫t n t f(‖∇f(𝐱 k(t))‖2−1)2d t m(t f−t n).subscript ℒ reg 𝑓 superscript subscript 𝑘 1 𝑚 subscript superscript subscript 𝑡 𝑓 subscript 𝑡 𝑛 superscript superscript norm∇𝑓 subscript 𝐱 𝑘 𝑡 2 1 2 differential-d 𝑡 𝑚 subscript 𝑡 𝑓 subscript 𝑡 𝑛\mathcal{L}{\text{reg}}(f)=\dfrac{\sum{k=1}^{m}\int^{t_{f}}{t{n}}(|\nabla f% (\mathbf{x}{k}(t))|^{2}-1)^{2},\mathrm{d}t}{m\left(t{f}-t_{n}\right)}.caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT ( italic_f ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∫ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∥ ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_t end_ARG start_ARG italic_m ( italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG .(5)
Similarly to the original NeuS, an additional regularization based on object masks can also be utilized for supervision, if such masks are provided.
Plugging(4) into(3) yields the definition of our volume renderer accounting for the varying, arbitrary illumination vectors 𝐥 k,l subscript 𝐥 𝑘 𝑙\mathbf{l}_{k,l}bold_l start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT. Next, plugging (2),(3) and(5) into(1), we obtain our objective function, which ensures the consistency between the simulations obtained from the input, and those obtained by volume rendering. It should be emphasized that, besides the eikonal regularization – which is standard and only serves to ensure the unit-length constraint of the normal, our strategy leads to a single objective optimization formulation for NVR-based 3D surface reconstruction from reflectance and normal data.
The discretization of the variational problem(1) is then achieved exactly as in the original NeuS work[23]. It is based on representing f 𝑓 f italic_f and ρ 𝜌\rho italic_ρ by MLPs and hierarchically sampling points along the rays.
4 Application to MVPS
We present a standalone MVPS pipeline that is built on top of the proposed reflectance and normal-based 3D reconstruction method. Our MVPS pipeline includes the following steps:
- 1.Compute the reflectance and normals maps for each viewpoint through PS;
- 2.Select a batch of the most reliable inputs {r k}subscript 𝑟 𝑘{r_{k}}{ italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } and {𝐧 k}subscript 𝐧 𝑘{\mathbf{n}_{k}}{ bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT };
- 3.Scale the reflectance values {r k}subscript 𝑟 𝑘{r_{k}}{ italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } across the entire image collection;
- 4.Simulate the radiance values following Eq.(2), using a pixel-wise optimal lighting triplet 𝖫 k subscript 𝖫 𝑘\mathsf{L}_{k}sansserif_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT;
- 5.Optimize the loss in Eq. (1) over the SDF f 𝑓 f italic_f and albedo ρ 𝜌\rho italic_ρ;
- 6.Reconstruct the surface from the SDF.
Step 1: PS-based reflectance and normal estimation
Any PS method is suitable for obtaining the inputs for each viewpoint. However, not all PS methods actually provide reflectance clues, and not all of them can simultaneously handle non-Lambertian surfaces and unknown, complex illumination. CNN-PS[7], for instance, provides only normals, and for calibrated illumination. For these reasons, we base our MVPS pipeline on the recent transformers-based method SDM-UniPS[8], which exhibits remarkable performance in recovering intricate surface normal maps even when images are captured under unknown, spatially-varying lighting conditions in uncontrolled environments. As advised by the author of[8], when the number of images is too large for the method to be applied, one can simply take the median of the results over sufficiently many N trials subscript 𝑁 trials N_{\text{trials}}italic_N start_POSTSUBSCRIPT trials end_POSTSUBSCRIPT random trials, each trial involving the random selection of a few number of images.
Step 2: Uncertainty evaluation
To prevent poorly estimated normals from corrupting 3D reconstruction, we discard the less reliable ones. To this end, we use as uncertainty measure the average absolute angular deviation of the normals computed over the N trials subscript 𝑁 trials N_{\text{trials}}italic_N start_POSTSUBSCRIPT trials end_POSTSUBSCRIPT random trials in Step 1. Pixels associated with an uncertainty measure higher than a threshold (τ=15∘𝜏 superscript 15\tau=15^{\circ}italic_τ = 15 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT in our experiments) are excluded from the optimization. Advanced uncertainty metrics, as proposed by Kaya et al.[10], could further refine this process.
Step 3: Reflectance maps scaling
The individual reflectance maps computed by PS need to be appropriately scaled. This is because in an uncalibrated setting, the reflectance estimate is relative to both the camera’s response, and the incident lighting intensity. Consequently, each reflectance map is estimated only up to a scale factor. To estimate this scale factor, the complete pipeline is first run without using the reflectance maps. This provides pairs of homologous points that are subsequently used to scale the reflectance maps. Concretely, given a pair of neighboring viewpoints, the ratios of corresponding reflectance values between the two viewpoints are stored, and their median is used to adjust each reflectance map’s scale factor. This operation is repeated across the entire viewpoint collection. Note that, if the camera’s response and the illumination were known i.e., a calibrated PS method was used in Step 1, then the reflectance would be determined without scale ambiguity and this step could be skipped.
Step 4: Radiance simulation
To simulate the radiance values, we choose as lighting triplet the one which is optimal, relative to the normal 𝐧 k subscript 𝐧 𝑘\mathbf{n}_{k}bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT[4]. The actual formula is provided in the supplementary material.
Step 5: Optimization
The actual optimization of the loss function is carried out using a straightforward adaptation of the NeuS architecture[23], where viewing direction was removed from the network’s input to turn radiance into albedo. In all our experiments, we let the optimization run for a total of 300k iterations, with a batch size of 512 pixels. To ensure that the networks have a better understanding of our MVPS data, we decided to train each iteration not only on a random view, but also on all rendered images of this view under varying illumination. The backward operation is then applied only after the loss is computed on all pixels for all the illumination conditions. In terms of computation time, our approach is comparable with the original NeuS framework, requiring in our tests from 8 8 8 8 to 16 16 16 16 hours on a standard GPU for the 3D reconstruction of each dataset from DiLiGenT-MV[13].
Step 6: Surface reconstruction
Once the SDF is estimated, we extract its zero level set using the marching cube algorithm[16].
5 Experimental results
5.1 Experimental setup
Evaluation datasets
We used the DiLiGenT-MV benchmark dataset [13] to perform all our experiments, statistical evaluations, and ablations. It includes five real-world objects with complex reflectance properties and surface profiles, making it an ideal choice for the proposed method evaluation. Each object is imaged from 20 calibrated viewpoints using the classical turntable MVPS acquisition setup[6]. For each view, 96 images are acquired under different illuminations. Given the large volume of images, which is impractical for transformers-based methods, our implementation of Step 1 (PS) employs SDM-UniPS[8] with only 10 10 10 10 input images. To this end, we computed each r k subscript 𝑟 𝑘 r_{k}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝐧 k subscript 𝐧 𝑘\mathbf{n}{k}bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as the medians of the computed reflectances and normals over N trials=100 subscript 𝑁 trials 100 N{\text{trials}}=100 italic_N start_POSTSUBSCRIPT trials end_POSTSUBSCRIPT = 100 random trials, each trial involving the random selection of 10 images from the 96 available in the DiLiGenT-MV dataset.
Evaluation scores
We performed our quantitative evaluations using F-score and Chamfer distance (CD), to measure the accuracy of the reconstructed vertices. We also measured the mean angular error (MAE) of the imaged meshes, to evaluate the accuracy of the reconstructed normals wrt the ground truth normals provided in DiLiGenT-MV. We report both the results averaged over all mesh vertices, and those on vertices clustered in two particularly interesting classes, namely high curvature and low visibility areas, as illustrated in Fig.3.
Figure 3: High curvature (left) and low visibility (right) areas, on the Buddha and Reading datasets.
To identify the high curvature areas, we used the library VCGLib [1] and the 3D mesh processing software system Meshlab [3], taking the absolute value of the curvature to merge the convex and concave zones and retaining the vertices whose curvature is higher than 1.6 1.6 1.6 1.6. To segment the low visibility areas, we summed the boolean visibility of each vertex in each view. Low visibility then corresponds to vertices visible in less than 5 viewpoints, among the 20 ones of DiLiGenT-MV.

PS-NeRF Kaya23 MVPSNet Ours GT PS-NeRF Kaya23 MVPSNet Ours GT
—————————————————————-—————————————————————-
Buddha Cow

PS-NeRF Kaya23 MVPSNet Ours GT PS-NeRF Kaya23 MVPSNet Ours GT
—————————————————————-—————————————————————-
Pot2 Reading
Figure 4: Reconstructed 3D mesh and corresponding angular error of four objects from the DiLiGenT-MV benchmark.
5.2 Baseline comparisons
We first provide in Fig.4 a qualitative comparison of our results on four objects, and compare them with the three most recent methods from the literature, namely PS-NERF[27], Kaya23[12] and MVPSNet[28]. In comparison with these state-of-the-art deep learning-based methods, the recovered geometry is overall more satisfactory.
This is confirmed quantitatively when evaluating Chamfer distances and MAE, provided in Tables1 and2. Therein, beside the aforementioned methods we also report the results from the Kaya22 method[10] and those from the non deep learning-based ones Park16[21] and Li19[13] (which is not fully automatic). From the tables, it can be seen that our method outperforms other fully automated standalone ones, and is competitive with the semi-automated one. On average, our method reports a Chamfer distance which is 17.4%percent 17.4 17.4%17.4 % better than the second best score, obtained by MVPSNet[28]. Regarding MAE, our score is similar to Kaya23[12] with a small average difference of 0.2 degree. The superiority of our approach can also be observed by considering the F-scores, which are reported in Fig.5.
Table 1: Chamfer distance (lower is better) averaged overall all vertices. Best results. Second best. Since ††\dagger† requires manual efforts, it is not ranked.
Table 2: Normal MAE (lower is better) averaged over all views. For reference, the mono-view PS results from SDM-UniPS[8] (*) are also provided, although it does not provide a full 3D reconstruction and thus its Chamfer distance cannot be evaluated.
Figure 5: F-score (higher is better) as a function of the distance error threshold, in comparison with other state-of-the-art methods (a), and disabling individual components of our method (b).
5.3 High curvature and low visibility areas
To highlight the level of details in the 3D reconstructions, Figs.1 and10 provide other qualitative comparisons focusing on one small part of each object. Ours is the only method achieving a high fidelity reconstruction on the ear, the knot and the navel of Buddha, and on the spout of Pot2. To quantify this gain, we also report in Table3 the average CD and MAE over all datasets, yet taking into account only the high curvature and low visibility areas. It is worth noticing that the CD error of PS-NeRF and MVPSNet on high curvature areas increases by 36%percent 36 36%36 % and 96%percent 96 96%96 %, respectively, in comparison with that averaged over the entire set of vertices. Ours, on the contrary, increases by 4%percent 4 4%4 % only. Similarly, on low visibility areas their error increases by 78%percent 78 78%78 % and 81%percent 81 81%81 %, and Kaya23 by 46%percent 46 46%46 %, while ours increases only by 13%percent 13 13%13 %.
Table 3: Chamfer distance and normal MAE (lower is better) on high curvature and low visibility areas.
Figure 6: Qualitative comparison between our results and state-of-the-art ones, on parts of the meshes representing fine details.
5.4 Ablation study
Lastly, we conducted an ablation study, to quantify the impact of some parts of our pipeline. More precisely, we quantify in Fig.5b and Table4 the impact of providing PS-estimated reflectance maps, in comparison with providing only normals (“W/o reflectance”). We also evaluate that of the pixel-wise optimal lighting triplet, in comparison with using the same arbitrary one for all pixels in one view (“W/o optimal lighting”). Lastly, we evaluate the impact of discarding the less reliable inputs, in comparison with using all of them (“W/o uncertainty”). The feature that influences most the accuracy of the 3D reconstruction is the use of reflectance. The other two features also positively impact the reconstruction, but to a lesser extent.
Table 4: Chamfer distance (lower is better) averaged overall all vertices, while disabling individual features of the pipeline (reflectance estimation, optimal lighting, and uncertainty evaluation).
5.5 Limitations
Our approach heavily relies on the quality of the PS normal maps. In our experiments, we used SDM-UniPS[8], which generally yields high quality results. Yet, it occasionally yields corrupted normals, leading to inconsistencies across viewpoints that may result in errors in the reconstruction (cf. supplementary material). This could be handled in the future by replacing the PS method by a more robust one. A second limitation, similar to PS-NeRF, is the computation time, which falls within the range of 8 to 16 hours for one object in DiLiGenT-MV. Fortunately, NeuS2[24], a significantly faster version of NeuS, will allow us to reduce the computation time to around ten minutes.
6 Conclusion
We have introduced a neural volumetric rendering method for 3D surface reconstruction based on reflectance and normal maps, and applied it to multi-view photometric stereo. The proposed method relies on a joint re-parameterization of reflectance and normal as a vector of radiances rendered under simulated, varying illumination. It involves a single objective optimization, and it is highly flexible since any existing or future PS method can be used for constructing the input reflectance and normal maps. Coupled with a state-of-the-art uncalibrated PS method, our method reaches unprecedented results on the public dataset DiLiGenT-MV in terms of F-score, Chamfer distance and mean angular error metrics. Notably, it provides exceptionally high quality results in areas with high curvature or low visibility. Its main limitation for now is its computational cost, which we plan to reduce by adapting recent developments within the NeuS2 framework[24]. Using reflectance uncertainty in addition to that of normal maps offers room for improvement.
Acknowledgements.
This work was supported by the Danish project PHYLORAMA, the ALICIA-Vision project, the IMG project (ANR-20-CE38-0007), the OR-X and associated funding by the University of Zurich and University Hospital Balgrist.
Appendix
This supplementary material provides technicalities and detailed analysis of the experiments. We provide the reader with explicit formulations of the evaluation metrics in Section A. We then share additional implementation details in Section B. In Section C, we present additional quantitative and qualitative results. In Section D, we illustrate some limitations of our method.
Appendix A Evaluation
Metrics.
All quantitative evaluations were carried out using Chamfer distance, F-score and mean angular error (MAE) between the reconstructed mesh 𝒫 𝒫\mathcal{P}caligraphic_P and the ground truth one 𝒢 𝒢\mathcal{G}caligraphic_G. For a reconstructed point 𝐱^∈𝒫^𝐱 𝒫\hat{\mathbf{x}}\in\mathcal{P}over^ start_ARG bold_x end_ARG ∈ caligraphic_P, its distance to the ground truth is defined as follows:
d 𝐱^→𝒢=min 𝐱∈𝒢‖𝐱^−𝐱‖,subscript 𝑑→^𝐱 𝒢 𝐱 𝒢 min norm^𝐱 𝐱 d_{\hat{\mathbf{x}}\rightarrow\mathcal{G}}=\underset{\mathbf{x}\in\mathcal{G}}% {\operatorname{min}}|\hat{\mathbf{x}}-\mathbf{x}|,italic_d start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG → caligraphic_G end_POSTSUBSCRIPT = start_UNDERACCENT bold_x ∈ caligraphic_G end_UNDERACCENT start_ARG roman_min end_ARG ∥ over^ start_ARG bold_x end_ARG - bold_x ∥ ,(6)
and vice versa for a ground truth point 𝐱∈𝒢 𝐱 𝒢\mathbf{x}\in\mathcal{G}bold_x ∈ caligraphic_G and its distance to the reconstructed mesh.
The distance measures are accumulated over the entire meshes to define the Chamfer distance
CD=1 2(1|𝒫|∑𝐱^∈𝒫 d 𝐱^→𝒢+1|𝒢|∑𝐱∈𝒢 d 𝐱→𝒫)𝐶 𝐷 1 2 1 𝒫 subscript^𝐱 𝒫 subscript 𝑑→^𝐱 𝒢 1 𝒢 subscript 𝐱 𝒢 subscript 𝑑→𝐱 𝒫 CD=\frac{1}{2}\left(\frac{1}{|\mathcal{P}|}\sum_{\hat{\mathbf{x}}\in\mathcal{P% }}d_{\hat{\mathbf{x}}\rightarrow\mathcal{G}}+\frac{1}{|\mathcal{G}|}\sum_{% \mathbf{x}\in\mathcal{G}}d_{\mathbf{x}\rightarrow\mathcal{P}}\right)italic_C italic_D = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( divide start_ARG 1 end_ARG start_ARG | caligraphic_P | end_ARG ∑ start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG ∈ caligraphic_P end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG → caligraphic_G end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG | caligraphic_G | end_ARG ∑ start_POSTSUBSCRIPT bold_x ∈ caligraphic_G end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT bold_x → caligraphic_P end_POSTSUBSCRIPT )(7)
and the F-score
F(ϵ)=2P(ϵ)R(ϵ)P(ϵ)+R(ϵ),𝐹 italic-ϵ 2 𝑃 italic-ϵ 𝑅 italic-ϵ 𝑃 italic-ϵ 𝑅 italic-ϵ F(\epsilon)=\frac{2P(\epsilon)R(\epsilon)}{P(\epsilon)+R(\epsilon)},italic_F ( italic_ϵ ) = divide start_ARG 2 italic_P ( italic_ϵ ) italic_R ( italic_ϵ ) end_ARG start_ARG italic_P ( italic_ϵ ) + italic_R ( italic_ϵ ) end_ARG ,(8)
where
P(ϵ)=1|𝒫|∑𝐱^∈𝒫[d 𝐱^→𝒢<ϵ]𝑃 italic-ϵ 1 𝒫 subscript^𝐱 𝒫 delimited-[]subscript 𝑑→^𝐱 𝒢 italic-ϵ P(\epsilon)=\frac{1}{|\mathcal{P}|}\sum_{\hat{\mathbf{x}}\in\mathcal{P}}[d_{% \hat{\mathbf{x}}\rightarrow\mathcal{G}}<\epsilon]italic_P ( italic_ϵ ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_P | end_ARG ∑ start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG ∈ caligraphic_P end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG → caligraphic_G end_POSTSUBSCRIPT < italic_ϵ
and
R(ϵ)=1|𝒢|∑𝐱∈𝒢[d 𝐱→𝒫<ϵ]𝑅 italic-ϵ 1 𝒢 subscript 𝐱 𝒢 delimited-[]subscript 𝑑→𝐱 𝒫 italic-ϵ R(\epsilon)=\frac{1}{|\mathcal{G}|}\sum_{\mathbf{x}\in\mathcal{G}}[d_{\mathbf{% x}\rightarrow\mathcal{P}}<\epsilon]italic_R ( italic_ϵ ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_G | end_ARG ∑ start_POSTSUBSCRIPT bold_x ∈ caligraphic_G end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT bold_x → caligraphic_P end_POSTSUBSCRIPT < italic_ϵ
are precision and recall measures, respectively, [.][.][ . ] is the Iverson bracket and ϵ italic-ϵ\epsilon italic_ϵ is the distance threshold.
The mesh segmentations into low visibility and high curvature areas are performed on the ground truth meshes. Because the geometry of the reconstruction differs from that of the ground truth, the segmentation procedure yields different areas when applied to the reconstruction. For this reason, the reported results for low visibility and high curvature areas only consider the Chamfer distance term indicating the average distances between the ground truth vertices and their nearest neighbors in the reconstructed mesh.
For the MAE computation, the reconstructed and ground truth meshes are projected onto image planes and the normals are computed at each pixel. The MAE over all the pixels M 𝑀 M italic_M is written as
MAE=1|M|∑k∈M cos−1(𝐧^k⊤𝐧 k).𝑀 𝐴 𝐸 1 𝑀 subscript 𝑘 𝑀 superscript 1 superscript subscript^𝐧 𝑘 top subscript 𝐧 𝑘 MAE=\frac{1}{|M|}\sum_{k\in M}\cos^{-1}({\hat{\mathbf{n}}{k}}^{\top}\mathbf{n% }{k}).italic_M italic_A italic_E = divide start_ARG 1 end_ARG start_ARG | italic_M | end_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ italic_M end_POSTSUBSCRIPT roman_cos start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG bold_n end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) .(11)
DiLiGenT-MV dataset.
All the state-of-the-art methods were evaluated from the meshes that were kindly provided by their authors. For all evaluated meshes, we eliminated all internal vertices. Then, a mesh upsampling for both estimated and ground truth meshes was then performed in order to achieve a point density of 0.1 0.1 0.1 0.1 mm. The computations of Chamfer distance and F-score were specifically conducted for distances under 5 5 5 5 mm in order to mitigate the impact of outliers (inspired by the DTU evaluation[9]). We observed a few defects in the ground truth meshes from the DiLiGenT-MV dataset in concave areas. Notably, such imperfections are well visible at the back of Bear’s head (Fig.7) and the spout’s inner area of Pot2 (Fig.8). Although these areas represent a small amount of vertices, they were discarded in all evaluations so as to avoid penalizing methods which faithfully reconstruct them.
Figure 7: Rear view of the 3D heatmaps representing errors for the Bear dataset in terms of Chamfer distance. (a) The ground truth from DiLiGenT-MV lacks any vertices in the rectangular aperture. For that reason, any method which faithfully reconstructs this area is penalized (area shown in red). This area is thus discarded in all evaluations, providing heatmaps such as (b).
Figure 8: Cross-section of Pot2’s spout delivered by (a) the ground truth of the DiLiGenT-MV dataset and (b) our reconstruction method. Our method shows a deeper reconstruction of the internal wall of the spout. This area is thus discarded in all evaluations to avoid penalizing methods that faithfully reconstructs it.
Manual efforts in[13].
Li19 [13] is mentioned as requiring manual efforts. Indeed, the authors manually establish point correspondences in textureless areas. See [13] for details.
Appendix B Implementation details
We recall that to simulate the radiance values in Step 4 described in Section 4 of the main paper, we choose as lighting triplet the one which is optimal, relatively to the normal 𝐧 k subscript 𝐧 𝑘\mathbf{n}{k}bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Following [4], this optimal triplet is equally spaced in tilt 120 degrees apart with a slant angle of 54.74 degrees. Concretely, the expression of 𝖫 k subscript 𝖫 𝑘\mathsf{L}{k}sansserif_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as a function of 𝐧 k subscript 𝐧 𝑘\mathbf{n}_{k}bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is written:
𝖫 k=𝖱 k𝖫 canonic subscript 𝖫 𝑘 subscript 𝖱 𝑘 subscript 𝖫 canonic\mathsf{L}{k}=\mathsf{R}{k}\mathsf{L}_{\text{canonic}}sansserif_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = sansserif_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT sansserif_L start_POSTSUBSCRIPT canonic end_POSTSUBSCRIPT(12)
where 𝖱 k=𝖴 subscript 𝖱 𝑘 𝖴\mathsf{R}{k}=\mathsf{U}sansserif_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = sansserif_U with [𝖴,Σ,𝖴]=SVD(𝐧 k𝐧 k⊤)𝖴 Σ 𝖴 SVD subscript 𝐧 𝑘 superscript subscript 𝐧 𝑘 top[\mathsf{U},\Sigma,\mathsf{U}]=\operatorname{SVD}(\mathbf{n}{k}\mathbf{n}_{k}% ^{\top})[ sansserif_U , roman_Σ , sansserif_U ] = roman_SVD ( bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) and
𝖫 canonic=[sin(ϕ)sin(ϕ)cos(θ)sin(ϕ)cos(2θ)0 sin(ϕ)sin(θ)sin(ϕ)sin(2θ)cos(ϕ)cos(ϕ)cos(ϕ)]subscript 𝖫 canonic matrix italic-ϕ italic-ϕ 𝜃 italic-ϕ 2 𝜃 0 italic-ϕ 𝜃 italic-ϕ 2 𝜃 italic-ϕ italic-ϕ italic-ϕ\mathsf{L}_{\text{canonic}}=\begin{bmatrix}\sin(\phi)&\sin(\phi)\cos(\theta)&% \sin(\phi)\cos(2\theta)\ 0&\sin(\phi)\sin(\theta)&\sin(\phi)\sin(2\theta)\ \cos(\phi)&\cos(\phi)&\cos(\phi)\ \end{bmatrix}sansserif_L start_POSTSUBSCRIPT canonic end_POSTSUBSCRIPT = start_ARG start_ROW start_CELL roman_sin ( italic_ϕ ) end_CELL start_CELL roman_sin ( italic_ϕ ) roman_cos ( italic_θ ) end_CELL start_CELL roman_sin ( italic_ϕ ) roman_cos ( 2 italic_θ ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL roman_sin ( italic_ϕ ) roman_sin ( italic_θ ) end_CELL start_CELL roman_sin ( italic_ϕ ) roman_sin ( 2 italic_θ ) end_CELL end_ROW start_ROW start_CELL roman_cos ( italic_ϕ ) end_CELL start_CELL roman_cos ( italic_ϕ ) end_CELL start_CELL roman_cos ( italic_ϕ ) end_CELL end_ROW end_ARG
with θ=120π 180 𝜃 120 𝜋 180\theta=\frac{120\pi}{180}italic_θ = divide start_ARG 120 italic_π end_ARG start_ARG 180 end_ARG and ϕ=54.74π 180 italic-ϕ 54.74 𝜋 180\phi=\frac{54.74\pi}{180}italic_ϕ = divide start_ARG 54.74 italic_π end_ARG start_ARG 180 end_ARG.
Appendix C Additional Results
In this section, we extend the experiments of the main paper by providing further statistical analysis and qualitative comparisons.
Comparison with mono-illumination NeuS.
We propose an additional comparison of our method against the multi-view mono-illumination 3D reconstruction method NeuS [23]. While NeuS is not directly applicable in multi-view multi-light acquisition settings in theory, it may become feasible under certain conditions. This feasibility hinges on factors such as the number, spatial distribution and types of lighting conditions, and the object material properties. One can leverage a heuristic method, initially proposed in [13] and later employed for obtaining pixel depths using MVS in [10, 12]. This heuristic involves approximating input images captured under mono-illumination for each viewpoint by taking the median of pixel intensities obtained under varying illumination. See, e.g.,[13] for detailed information.
Figure 9: Qualitative comparison of Buddha and Reading between mono-illumination NeuS and our method, for different number of input viewpoints.
Chamfer distance (visibility 1-5) ↓↓\downarrow↓Chamfer distance (high curvature) ↓↓\downarrow↓Methods Bear Buddha Cow Pot2 Reading Average Bear Buddha Cow Pot2 Reading Average Park16 1.07 0.75 0.41 0.47 0.7 0.68 1.64 0.58 0.98 0.56 0.65 0.88 Li19††\dagger†0.63 1.03 0.37 0.54 0.81 0.67 0.59 0.65 0.38 0.34 0.57 0.51 NeuS 0.58 0.52 0.17 0.32 0.54 0.42 0.28 0.46 0.21 0.39 0.38 0.35 Kaya22 0.48 0.51 0.32 0.5 0.7 0.5 0.33 0.43 0.31 0.41 0.45 0.38 PS-NeRF 0.48 0.62 0.3 0.66 0.64 0.54 0.42 0.5 0.42 0.44 0.44 0.45 Kaya23 0.46 0.35 0.39 0.42 0.44 0.41 0.33 0.29 0.19 0.3 0.33 0.29 MVPSNet 0.43 0.68 0.27 0.49 0.57 0.49 0.56 0.58 0.52 0.47 0.54 0.53 Ours 0.23 0.27 0.19 0.19 0.43 0.26 0.22 0.23 0.26 0.23 0.25 0.24
Table 5: Chamfer distance on (a) low visibility and (b) high curvature areas. Best results. Second best results.
A qualitative comparison between the results of mono-illumination NeuS using this heuristic and the ones from our method is provided in Fig.9. As can be seen, our proposed approach provides a much finer level of details. In particular, mono-illumination NeuS requires a high number of viewpoints, with a drastic decline in the reconstruction quality when using 5 viewpoints. On the contrary, our method shows stable results, only losing some fine details over concave areas. Moreover, even with all viewpoints used, mono-illumination NeuS fails in reliably reconstructing the low visibility and high curvature areas. In addition to Fig. 9 (right), this can be observed in the quantitative evaluation provided in Table5, where mono-illumination NeuS shows a reconstruction error 62% higher than ours on low visibility areas and 46% higher than ours on high curvature areas.
Photometric stereo method.
Our method can be employed with any PS method. To illustrate this flexibility, we evaluate the reconstruction accuracy on the Buddha dataset while taking as input the normal maps from CNN-PS[7], used in Kaya22-23[10, 12], and SDPS-Net[chen2019SDPS_Net], used in PS-NeRF [27], in addition to the one obtained using normal maps from SDM-UniPS[8] reported in the main paper. The results are reported in Table 6. As expected, we observe that the choice of a particular PS technique influences the final outcome, yet our framework consistently improves the results in comparison with previous works, including those based on multi-objective optimizations [10, 12].
Table 6: Results of our method with different input normals, namely CNN-PS (used in Kaya22-23), SDPS-Net (used in PS-NeRF) and SDM-UniPS. High curvature corresponds to the results averaged over all the vertices whose absolute curvature is higher than 3.3 3.3 3.3 3.3. Our method shows to perform best irrespective of the PS method being used.
Ablation.
We complete our ablation study with qualitative results on the ear and the knot of Buddha shown in Fig.10.
Figure 10: Qualitative comparison on the knot and the ear of Buddha between our results and those without the use of reflectance and optimal lighting, disabled individually. Our method exhibits better results in both cases.
Additional benchmarking.
We provide in Fig.13 a qualitative comparison of the angular error maps on the five objects of DiLiGenT-MV, for our method and state-of-the-art ones, namely Park16[20], Li19[13], Kaya22[11], PS-NeRF[27], Kaya23[12], MVPSNet[28] and also SDM-UniPS[8] although it does not provide a full 3D reconstruction. The recovered geometry shows to be overall more accurate with our method. Interestingly, our recovered normals overcome the PS ones, especially in concave areas where inter-reflections bias the single-viewpoint reconstruction. Lastly, we provide further quantitative comparisons, namely precision and recall in Fig.11, and MAE on low visibility and high curvature areas in Table7. Our proposed approach consistently yields the most accurate reconstructions.
Figure 11: (a) Precision (higher is better) and (b) recall (higher is better) as functions of the distance error threshold, in comparison with other state-of-the-art methods.
Appendix D Limitations
The reconstructions obtained through the proposed method yet exhibit a few poorly reconstructed areas, as illustrated in Fig. 12, particularly for Reading’s neck and Bear’s right ear. The suboptimal reconstruction of Reading’s neck can be attributed, in part, to inacurracies of the normal estimates from SDM-UniPS. However, the underlying causes of these discrepancies have yet to be systematically identified.
Figure 12: Regions in Bear (a) and Reading (b) where our method exhibits limitations.







































Park16 Li19††\dagger†Kaya22 PS-NeRF Kaya23 MVPSNet SDM Ours
Figure 13: Normal angular error comparison over all DiLiGenT-MV dataset between state-of-the-art methods and ours.
Normal MAE (Visibility 1-5) ↓↓\downarrow↓Normal MAE (High Curvature) ↓↓\downarrow↓Methods Bear Buddha Cow Pot2 Reading Average Bear Buddha Cow Pot2 Reading Average Park16 38.5 29.3 34.6 25.2 20.6 29.6 31.7 26.2 39.5 23.3 24.1 29.0 Li19††\dagger†41.1 33.7 29.4 39.0 23.3 33.3 26.5 26.4 30.6 23.4 24.1 26.2 Kaya22 32.0 27.6 40.5 40.0 18.4 31.7 20.2 29.1 35.9 32.8 21.8 28.0 PS-NeRF 19.4 19.6 27.4 32.2 21.1 24.0 21.2 28.0 27.9 23.9 28.3 25.8 Kaya23 19.4 17.6 24.0 28.1 14.6 20.7 24.1 24.2 21.6 28.5 19.3 23.6 MVPSNet 31.1 29.6 30.4 35.3 18.1 28.9 18.7 26.3 27.7 23.3 23.5 23.9 SDM 12.9 14.4 28.5 25.7 16.9 19.7 21.6 21.0 23.4 28.2 24.7 23.8 Ours 13.0 14.1 26.8 21.5 13.5 17.8 18.4 24.2 28.0 24.9 19.9 23.1
Table 7: Normal MAE on (a) low visibility and (b) high curvature areas. Best results. Second best results.
References
- [1] VCGLib. https://github.com/cnr-isti-vclab/vcglib.
- Asthana et al. [2022] Meghna Asthana, William Smith, and Patrik Huber. Neural apparent BRDF fields for multiview photometric stereo. In Proceedings of the 19th ACM SIGGRAPH European Conference on Visual Media Production, pages 1–10, 2022.
- Cignoni et al. [2008] Paolo Cignoni, Marco Callieri, Massimiliano Corsini, Matteo Dellepiane, Fabio Ganovelli, Guido Ranzuglia, et al. Meshlab: an open-source mesh processing tool. In Proceedings of the Eurographics Italian Chapter Conference, pages 129–136, 2008.
- Drbohlav and Chantler [2005] Ondrej Drbohlav and Mike Chantler. On optimal light configurations in photometric stereo. In Proceedings of the 10th IEEE International Conference on Computer Vision, pages 1707–1712, 2005.
- Furukawa et al. [2015] Yasutaka Furukawa, Carlos Hernández, et al. Multi-view stereo: A tutorial. Foundations and Trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015.
- Hernández et al. [2008] Carlos Hernández, George Vogiatzis, and Roberto Cipolla. Multiview Photometric Stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(3):548–554, 2008.
- Ikehata [2018] Satoshi Ikehata. CNN-PS: CNN-based photometric stereo for general non-convex surfaces. In Proceedings of the European Conference on Computer Vision, pages 3–18, 2018.
- Ikehata [2023] Satoshi Ikehata. Scalable, Detailed and Mask-Free Universal Photometric Stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13198–13207, 2023.
- Jensen et al. [2014] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engil Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 406–413, 2014.
- Kaya et al. [2022a] Berk Kaya, Suryansh Kumar, Carlos Oliveira, Vittorio Ferrari, and Luc Van Gool. Uncertainty-aware deep multi-view photometric stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12601–12611, 2022a.
- Kaya et al. [2022b] Berk Kaya, Suryansh Kumar, Francesco Sarno, Vittorio Ferrari, and Luc Van Gool. Neural radiance fields approach to deep multi-view photometric stereo. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1965–1977, 2022b.
- Kaya et al. [2023] Berk Kaya, Suryansh Kumar, Carlos Oliveira, Vittorio Ferrari, and Luc Van Gool. Multi-View Photometric Stereo Revisited. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3126–3135, 2023.
- Li et al. [2020] Min Li, Zhenglong Zhou, Zhe Wu, Boxin Shi, Changyu Diao, and Ping Tan. Multi-view photometric stereo: A robust solution and benchmark dataset for spatially varying isotropic materials. IEEE Transactions on Image Processing, 29:4159–4173, 2020.
- Li et al. [2023] Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H Taylor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin. Neuralangelo: High-Fidelity Neural Surface Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8456–8465, 2023.
- Logothetis et al. [2019] Fotios Logothetis, Roberto Mecca, and Roberto Cipolla. A differential volumetric approach to multi-view photometric stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1052–1061, 2019.
- Lorensen and Cline [1998] William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3D surface construction algorithm. In Seminal graphics: pioneering efforts that shaped the field, pages 347–353. 1998.
- Mildenhall et al. [2021] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. Communications of the ACM, 65(1):99–106, 2021.
- Nehab et al. [2005] Diego Nehab, Szymon Rusinkiewicz, James Davis, and Ravi Ramamoorthi. Efficiently combining positions and normals for precise 3D geometry. ACM Tansactions on Graphics, 24(3):536–543, 2005.
- Oechsle et al. [2021] Michael Oechsle, Songyou Peng, and Andreas Geiger. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5589–5599, 2021.
- Park et al. [2013] Jaesik Park, Sudipta N Sinha, Yasuyuki Matsushita, Yu-Wing Tai, and In So Kweon. Multiview photometric stereo using planar mesh parameterization. In Proceedings of the IEEE International Conference on Computer Vision, pages 1161–1168, 2013.
- Park et al. [2016] Jaesik Park, Sudipta N Sinha, Yasuyuki Matsushita, Yu-Wing Tai, and In So Kweon. Robust multiview photometric stereo using planar mesh parameterization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(8):1591–1604, 2016.
- Wang et al. [2021a] Fangjinhua Wang, Silvano Galliani, Christoph Vogel, Pablo Speciale, and Marc Pollefeys. Patchmatchnet: Learned multi-view patchmatch stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14194–14203, 2021a.
- Wang et al. [2021b] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction. In Proceedings of the Conference on Neural Information Processing Systems, 2021b.
- Wang et al. [2023] Yiming Wang, Qin Han, Marc Habermann, Kostas Daniilidis, Christian Theobalt, and Lingjie Liu. Neus2: Fast learning of neural implicit surfaces for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3295–3306, 2023.
- Woodham [1980] Robert J Woodham. Photometric method for determining surface orientation from multiple images. Optical Engineering, 19(1):139–144, 1980.
- Xu et al. [2022] Qingshan Xu, Weihang Kong, Wenbing Tao, and Marc Pollefeys. Multi-scale geometric consistency guided and planar prior assisted multi-view stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4945–4963, 2022.
- Yang et al. [2022] Wenqi Yang, Guanying Chen, Chaofeng Chen, Zhenfang Chen, and Kwan-Yee K Wong. PS-NeRF: Neural Inverse Rendering for Multi-view Photometric Stereo. In Proceedings of the European Conference on Computer Vision, pages 266–284, 2022.
- Zhao et al. [2023] Dongxu Zhao, Daniel Lichy, Pierre-Nicolas Perrin, Jan-Michael Frahm, and Soumyadip Sengupta. MVPSNet: Fast Generalizable Multi-view Photometric Stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12525–12536, 2023.
Xet Storage Details
- Size:
- 92.9 kB
- Xet hash:
- b45e2fc1c0e69bc34423ddb861eb14275942d0f80dfd7bddb3d94c88570368a0
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.



