Depth Map Denoising and Inpainting Using Object Shape Priors

Díaz, Andrés; Caicedo, Eduardo; Paz, Lina; Piniés, Pedro; Díaz, Andrés; Caicedo, Eduardo; Paz, Lina; Piniés, Pedro

doi:10.13053/cys-24-1-3004

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Computación y Sistemas

versión On-line ISSN 2007-9737versión impresa ISSN 1405-5546

Comp. y Sist. vol.24 no.1 Ciudad de México ene./mar. 2020 Epub 27-Sep-2021

https://doi.org/10.13053/cys-24-1-3004

Articles

Depth Map Denoising and Inpainting Using Object Shape Priors

Andrés Díaz¹^*

Eduardo Caicedo¹

Lina Paz²

Pedro Piniés²

^¹ Universidad del Valle, Colombia. andres.a.diaz@correounivalle.edu.co, eduardo.caicedo@correounivalle.edu.co

^² Intel Corporation, USA. paz.linapaz@gmail.com, peternac@gmail.com

Abstract

We present a system that improves the quality of noisy and incomplete depth maps captured with inexpensive range sensors. We use a model-based approach that measures the discrepancy between a model hypothesis and observed depth data. We represent the model hypothesis as a 3D level-set embedding function and the observed data as a point cloud coming from a segmented region associated to the object of interest. The discrepancy between the model and the observed data defines an objective function, that is minimized to obtain pose, scale and shape. The variation in shape of the object of interest is mapped with Gaussian Process Latent Variable Models GPLVM and the object pose is estimated using Lie algebra. The integration of a synthetic depth map, obtained from the optimal model, and the observed depth map is carried out with variational techniques. As a consequence we work in the observed space (depth space) rather than in a high dimensional volumetric space.

Keywords: Shape prior; 3D level-set embedding function; Levenberg-Marquardt; lie algebra; depth integration; variational techniques; Gaussian process latent variable models; denoising and inpainting

1 Introduction

High-quality depth information is important for many applications, such as 3D scene reconstruction ^[³⁵^], visual odometry ^[¹⁴^], action recognition ^[⁴^,²⁶^], scene understanding ^[²¹^,³⁷^], 3D segmentation ^[⁹^,¹¹^], among others. However, specular or light absorbing materials, out of range regions and occlusions produce corrupted depth maps with missing information. These problems are more common in depth sensors that are manufactured with smaller size and lower power requirements but compromising the quality of the data. The most challenging situation arises when unknown regions become large and irregular, affecting considerably the results of the applications.

To overcome these drawbacks we consider the outstanding performance of variational techniques applied to fusion of color images ^[¹⁷^] and propose to use them for merging two sources of depth data; one that is noisy and incomplete, coming from the depth camera and the other one that is high-quality data, coming from an optimal 3D model. This model (shape prior) is the one that best fits to its associated depth data in the scene. In this way, we improve the quality of a noisy and incomplete depth map, especially in the region of the object of interest.

The main steps of our system are the following. First, we estimate the optimal model. Second, we place the model using the optimized pose, shape and scale, and read the depth buffer from the current camera pose. Third, we integrate the two depth maps using variational techniques. These steps are intended to serve as a pre-processing stage for the aforementioned applications, specially for 3D scene reconstruction.

We can outline three main contributions: 1. We propose an approach to enhance depth maps by integrating depth measurements and shape prior data using a novel formulation of an objective function with a variational approach; 2. The object pose optimization is solved using Lie algebra se(3) instead of the group of rigid transformations in 3D space, SE(3); 3. We quantify the accuracy of pose, shape and scale optimizations. Moreover, we compute the completeness and accuracy of inpainted and denoised depth maps.

This work is structured as follows. In section II similar works are presented. In section III the main processes carried out by the system are described: model alignment, truncated signed distance function TSDF estimation, model compression, dimension reduction, estimation of the optimal pose and scale, search of the optimal latent variable, and integration of the optimal model with a novel variational method. In section IV experiments for optimizing pose, scale and shape with synthetic and real data are carried out. Moreover, experiments for quantifying the accuracy of the enhanced depth maps are described. Finally, the conclusions are presented in section V.

2 Related Work

Techniques for depth recovery include morphological filters ^[³²^], Laplace filters ^[²³^,³³^], Markov Random Fields ^[²⁷^], multilateral filters ^[¹³^,²²^], non-linear diffusion and variational frameworks ^[⁷^,¹⁰^,²⁸^] and learning-based methods ^[¹²^,¹⁹^,³⁴^]. Our system presents a novel approach motivated by outstanding works in object shape priors, scene shape priors and variational techniques. Next, we describe some details about these systems and we compare them with ours.

The system ^[³⁰^] is the first one that back projects depth images and evaluates the resulting point cloud in a 3D level-set embedding function that represents an object model implicitly, instead of projecting the model to the image plane and measuring the discrepancy between expected image cues and the observed ones. Moreover, no point correspondence is required, unlike Iterative Closest Point ICP, since the alignment consist in evaluating the closeness of the points to the zero-level of the embedding function. The authors in ^[¹⁸^] remove point cloud artifacts like noisy points, missing data and outliers using a learned shape prior of an object of interest.

They use the discrete cosine transform DCT for compressing the SDF values and GPLVM for dimension reduction.

The system ^[¹^] learns a semantic prior comprised of a mean shape for a category (common aspects in shape of a category) and a set of weighted anchor points for instances of the category (specific details). The augmented reconstruction consists in matching anchor points (HOG features), warping by anchor points (an extension of thin-plate spline transformations ^[³¹^]) and refinement. The system ^[⁶^] combines shape priors and live dense reconstruction using a monocular camera. Photo-consistency and a variational approach are used for building depth maps ^[²⁵^] that are fused for creating a dense reconstruction, PTAM ^[¹⁶^] for tracking the camera, a part-based object detector ^[⁸^], and GPLVM to represent shape priors. The energy function, besides depending on the evaluation of the 3D points on the embedding function like ^[³⁰^] and ^[¹⁸^], depends on the matching between the 2D object segmentation, defined by the foreground and background, and the projected 3D SDF.

The system ^[²⁹^] estimates depth maps using a monocular camera in workspaces with large plain structures like floors, walls or ceilings. Good depth data is propagated to an interior pixel (inpainting) from the closest valid pixels along the main 8 star directions by using a non-local high-order regularisation term, in a variational approach, that favours solutions with affine surfaces (prior). The energy is minimized in straight way with the primal-dual algorithm. The system ^[⁵^] includes a term, besides the data term and regularization term, that depends on three scene priors: planarity of homogeneous color regions (using superpixels), the repeating geometry primitives of the scene (data-driven 3D primitives learned from RGBD data), and the Manhattan structure of indoor rooms (layout estimation and classification of box-like structures).

Our system is not based on scene shape priors like ^[⁵^,²⁹^] but on object shape priors like ^[¹^,⁶^,¹⁸^,³⁰^]. However, it does not uses neither anchor points like ^[¹^] nor a depth and intensity-based energy function like ^[⁶^], but GPLVM like ^[⁶^,¹⁸^] and a depth-based energy function like ^[¹⁸^,³⁰^].

We use a variational technique, like is done in ^[⁵^,²⁹^] for merging scene shape priors, but we exploit the idea of integrating two depth maps, one coming from the sensor and the other one coming from the object shape prior, in a similar way as is done for aerial color images in ^[¹⁷^]. The latter system transforms a large point cloud into a common orthographic aerial view. This noisy color image with undefined areas due to occlusions and non-stationary objects is subjected to denoising and inpainting by integrating redundant observations of the same scene using variational techniques. It uses the primal-dual algorithm for minimizing the proposed energy. Finally, our approach implies that, in a future work, the enhanced depth maps will be fused instead of including directly the shape prior as a volumetric structure into the general reconstruction of the scene like is done in ^[⁶^].

3 Methodology

The main pipeline of the system is shown in fig. 1. The set of reference models (predefined database of cars) are aligned using ICP for getting models with the same scale, translation and rotation. A volumetric grid with TSDF values is computed and a compression is done using DCT, for maintaining just the low-frequency components of the reference models. A continuous, nonlinear, probabilistic and lower dimensional latent shape space that captures the prior knowledge on the 3D shapes that an object can take is found with Gaussian Process Latent Variable Models GPLVM.

Fig. 1 Main pipeline of the proposed system. The final result is a denoised and inpainted depth map

Scaled conjugate gradient SCG is employed for leaning the mapping between the latent variables and the low-frequency components (coefficients). Next, the optimal pose, scale and shape are computed and the coefficients associated to the latent variable that best fits to depth data are estimated. The 3D level-set embedding function encoded in the coefficients is computed with the inverse discrete cosine transform IDCT. The optimal model is used for creating a synthetic depth map by reading the depth buffer of the explicitly represented model seen from the estimated camera pose.

The synthetic depth map is merged with the depth data coming from the sensor using a novel variational formulation. In the next sections, the main modules of the system are explained.

3.1 Alignment and TSDF Estimation of the 3D Models

We downloaded 49 models of cars from Turbosquid, a vast online catalog of 3D models. We align each model w.r.t a base model (see fig. 2(h), model chosen arbitrarily) using just the vertices of the triangular mesh, read from an .obj file. Initially, the point cloud of the base car is translated from their center of gravity to the origin. Then, it is scaled for fitting it to the volumetric grid, and the average distance of each point to the new origin is computed (see ^[³^] for more details):

sbase=∑i=1NbaseXi2+Yi2+Zi2Nbase, (1)

Fig. 2 9 of the 49 3D models of instances of the class car (object of interest), used for estimating the latent space

where N_base is the number of points of the base model. The same process is carried out (translation and average distance computation) for the point cloud of the model i for getting s_i. The scale is computed as:

s=sbasesi. (2)

This scale is applied to model i in order to get models with the same size. Then, the models are manually oriented in such a way that their frontal sides point to the same direction, standing over a parallel supporting plane. This initial alignment is refined using ICP, implemented in CloudCompare.

Once the models have been aligned, the model i is loaded in OpenGL using the vertices and facets (just geometry data), creating a continuous surface. The virtual camera is moved in such a way that circular and vertical scanning at different longitudes (changes in azimuth and elevation of 10°) are performed.

The optical axis of the virtual camera is always pointing to the origin that coincides with the center of gravity of the loaded model. Depth images, obtained by reading the depth buffer of OpenGL, are fused into a volumetric structure, storing in each voxel a TSDF value. We have N_m= 49 3D models of cars with different shapes (see 9 of the 49 models in fig. 2). Each model is a 3D level-set embedding function defined by a volumetric grid of N_g × N_g × N_g voxels, with N_g = 128. The matrix M∈RNm×Ng3 stores these models as row vectors. However, searching for a model that best fits to data in this high dimensional space is too complex, so we compress the data using DCT. Then, we find a lower dimensional space and the mapping parameters for passing from this latent space to the space of DCT coefficients:

3.2 Compression of the Embedding Function

The DCT transforms a time domain signal to a frequency domine signal. It is used for compression of audio (e.g. MP3) and images (e.g. JPEG) where small high-frequency components can be discarded. We use here this technique for compressing the TSDF values of the volumetric grid. Let ϕu,v,w∈RNg×Ng×Ng be the volumetric grid that stores the TSDF values of a 3D model. The two-dimensional DCT for slices ϕ_i in direction u of ϕ, it means, c_i(v,w) = DCT(ϕ_i(v,w)) is given by:

civ,w=αv,w∑y=0Ng-1cosπ2y+1v2Ng∑z=0Ng-1ϕiy,zcosπ2z+1w2Ng, (3)

where α(v,w)=α_v(v)*α_w(w), with:

αvv=αww=1Ngifv,w=0,2Ngifv,w≠0. (4)

Since we have N_g = 128 slices, the size of ϕ_i is (N_g × N_g). Equation (3) can be expressed as matrix and array multiplications:

ci=α.*AϕiAT, (5)

where A∈RNc×Ng with N_c the number of DCT coefficients (desired low-frequency components). Each row of A stores the cosine cosπ2x+1u2Ng for a common u, with u = 1,2, …, N_c.

We extend it to the three-dimensional DCT and apply the property of separability (see ^[¹⁵^] for more details), computing the 3D-DCT in two steps with 2D transformations over the slices for each i along dimension u and a final 1D operation. Figure 3 shows the 3D models obtained using 10, 25, 35, 50, 75 and 100 coefficients of the model visualized in 2(b). Here we can see that the information about the general shape is stored in the low-frequency components and the fine details are store in the high-frequency components.

Fig. 3 3D model of fig. 2(b) using different number of coefficients in each dimension (degree of compression)

Note that for 50, 75 and 100 coefficients there are not significant changes in appearance. We consider that 25 × 25 × 25 coefficients are enough for our purposes (like in ^[¹⁸^]).

3.3 Learning the Latent Space

The Latent variable Model LVM is used for dimensionality reduction, to capture the shape variance as low dimensional latent shape spaces, such that the resulting latent variables have less dimensions than the original observed data. Let Y∈RNm×Nc3 be the matrix that stores in row Yj∈R1×Nc3 the N_c = 25 low-frequency coefficients in each dimension of the volumetric grid, and for each model j, with j = 1:1: N_m.

Let χ∈RNm×Nx, with Nc3≫Nx, be the matrix that stores in each row the latent variable χj∈R1×Nx that represents the coefficients of model j. Now, we learn from the coefficients the parameters (θ₁, θ₂, θ₃, θ₄) that map any latent variable χ_j to the corresponding row vector of coefficients Y_j, where its relationship is non-linear.

Since the mapping χ_j→ Y_j is modelled using a Gaussian process, we can define areas of high probability, it means, areas where there are high certainty of getting a valid shape. Moreover, this technique allows to map coefficients from a continuous latent space (no just the ones used for learning), creating a continuous search of the latent variable that best fits to data. We define the vector W∈R1×Nm*Nx+4 as:

W=χ1,1…χ1,Nm…χNx,1…χNx,Nmθ1θ2θ3θ4. (6)

It is the vector of parameters that solves the following optimization problem:

minW⁡EW=Nm2ln2π+12lnK+12trK-1S, (7)

where N_m is the number of models, K is the covariance matrix which is estimated with a Kernel function, and S=D-1YYT, with D the number of coefficients for a model Nc3 The components of K, denoted as k_ij = K(χi, χj), for a latent variable with two dimensions (N_x = 2) and a Radial Basis Function kernel, are:

kij=θ1exp-θ22χ1,i-χ1,j2+χ2,i-χ2,j2+θ3+δijθ4, (8)

where δ_ij is the Kronecker delta function. We initialize the latent variables of the vector of parameters W with the estimation got with Dual Probabilistic PCA. Then, we use the scaled conjugate gradient SCG algorithm for refining the initial estimation. It combines the model trust region approach, known from the Levenberg-Marquardt algorithm, with the conjugate gradient approach.

The pseudocode and more details about SCG can be found in ^[²⁴^]. The parameters for the kernel are set according to ^[²⁰^]; θ₁ = 1, θ₂ = 1, θ₃ = exp(-1), and θ₄ = exp(-1). With these values we can set the initial kernel and continue the iterative process until convergence.

The resulting latent space is employed in section 4.1 for shape optimization. Besides the mean value, each point χ_P in the latent space has a variance which is computed as:

σ2=Kχp,χp-Kχ,χpK-1Kχ,χpT. (9)

The parameters for mapping converged to θ₁ = 10.4195, θ₂ = 1.6784, θ₃ = 84.5378, and θ₄ = 1.1679.

3.4 Shape Prior Estimation. Shape Optimization

We define an energy that depends on the sum of squared residuals:

EΦ=12∑Φ2, (10)

where the residual vector Φ is defined with the German-McClure function, which is robust to outliers:

Φ=ϕXq2ϕXq2+σ, (11)

where σ is a constant parameter, X_q = [X_qY_qZ_q] is one of the N_p 3D points that are evaluated (3D interpolation) in the embedding function ϕ. Points that are back-projected outside the volumetric grid are assigned a large value. The derivative of the energy w.r.t. the latent variable χp is computed through the chain rule:

δEδχp=δEδΦ*δΦδχp. (12)

The first term on the right is the residual vector Φ. The variations of the residual vector Φ due to changes in the latent variable χp corresponds to the Jacobian:

J=δΦδχp=δΦδϕ*δϕδχp, (13)

where:

δΦδϕ=2σϕXqϕXq2+σ2. (14)

For defining the second term on the right of eq. 13 we express the embedding function ϕ in terms of the latent variable:

ϕ=IDCTvχp. (15)

The function IDCT(·) is the inverse discrete cosine transform applied to the mean v of the coefficients associated with the latent variable χp through a Gaussian process:

vχp=Kχ,χpK-1Y. (16)

Considering that the derivative of the IDCT of a variable is the IDCT of the derivative of the variable, we have:

δϕδχp=δIDCTvδχpIDCTδvδχp, (17)

where:

δvδχp=δKχ,χpδχpK-1Y, (18)

Kχ,χp∈R1×Nm is the row kernel for the latent variable χp. It depends on the squared euclidean distance between the latent variable χp and each one of the N_m latent variables used for learning the parameters for mapping. Each element of the row kernel for χp , with N_x = 2, is found using eq. (8), replacing χj by χp. The derivative of Kχ,χp w.r.t each component of the latent variable is computed in a straight way. The latent variable is updated by addition, χpk+1=δχp+χpk, where δχp is found with Levenberg-Marquardt:

δχp=-JTJ+αdiagJTJIJTΦ. (19)

We start with α = 1e - 4. After each iteration, the new energy is compared with the previous one. If the energy has decreased, α is reduced by a factor of 10 and the parameters are updated. If the energy has increased, the parameters are not updated and α is increased by a factor of 10.

3.5 Shape Prior Estimation: Pose and Scale Optimization

We now differentiate the energy of eq. (10) w.r.t. the pose and scale, optimizing them in separated and alternating way. The pose is minimally parametrized with a vector ζ=wx,wy,wz,vx,vy,vz, where the first three elements define the axis of rotation and its norm defines the magnitude of the rotation. The remaining elements define the translation after carrying out the rotation. The derivative of the energy w.r.t. ζ is:

δEδζ=δEδΦ*δΦδζ. (20)

Again, the first term on the right is the residual Φ. The variations of the residual vector Φ due to changes in pose ζ correspond to the Jacobian for this problem:

J=δΦδζ=δΦδϕ*δϕδζ. (21)

The first term on the right was defined in equation (14). The second term on the right is:

δϕδζ=δϕδXq*δXqδζ, (22)

δϕδXq is computed numerically, using the central difference formula:

δfxδx=fx+h-fx-h2h. (23)

Since a point X_q = [X_qY_qZ_q] (coordinates of voxels referenced to the initial coordinate system q), we get:

δϕδXq=δϕδXqδϕδYqδϕδZq1, (24)

with h = 10^-8 for all the components. For defining δXqδζ we first express a point X_q in terms of X_w. We have that:

Xw=MwoSMoqXq, (25)

M_wo is the transformation matrix between the object and the world system. M_oq is the transformation matrix between the initial and the object system. It corresponds to a translation to the center of the volumetric grid. Figure 4 shows the initial coordinate system q, the object system o, and the world system w. Since the point cloud is referenced to the world coordinate system, the 3D model is also referenced to this system and then, aligned with the point cloud.

Fig. 4 Initial coordinate systme q, object coordinate system o and world coordinate system w

S scales the points X_o and is defined as:

S=s00s00000000s001. (26)

Solving for the point X_q, we get:

Xq=Moq-1S-1MowXw, (27)

The transformation matrix M_ow is updated as:

Mowk+1=ΔMowMowk, (28)

where the incremental change in the transformation is expressed using the exponential mapping and the lie algebra:

ΔMow=expΔζx. (29)

The operator [⋅]_x transform a six-elements vector to the skew matrix:

Δζx=0-wzwz0wyvx-wxvy-wywx000vz00. (30)

Δζ is computed in similar way as δXp in equation (19). Replacing Mowk+1 of eq. (28) and ΔM_ow of eq. (29) into eq. (27), we get:

Xq=Moq-1S-1expΔζxMowkXw, (31)

MowkXw is the 3D point Xok previous to update the transformation matrix M_ow. Now, δXqδζ considering the scale constant, is:

δXqδζ=Moq-1S-1δexpΔζxΔζXok, (32)

where the derivative of the exponential w.r.t the first element of the vector ζ is:

δexpΔζxδwx=000000-1001000000. (33)

This derivative is a generator for rotations in x-axis in the lie algebra. The same process is carried out for the remaining five parameters. Multiplying by the homogeneous form of X_o and organizing, we get a 4 × 6 matrix.

δexpΔζxδwxXok=0-XozXoy0Xoz0-Xox0-XoyXox00100010001000. (34)

Following a similar process for scale optimization, we consider the parameters of pose constant and derive X_q from eq. (27) w.r.t the scale parameter:

δXqδs=-Moq-1S-1δSδsS-1Xok, (35)

where δSδs is easily computed.

3.6 Inpainting and Denoising

TV-based methods are well suited for tasks like depth data integration, as was probed in ^[³⁶^]. In our context, an energy defined by a TV-based regularization together with a data term that measures the discrepancy between two sources of data (the depths coming from the model D_m(μ) and from the sensor D_s(μ)) and the sought solution D_f(μ), is implemented:

where λ defines the balance between the regularizer term and the data term, D₁(μ) = D_s(μ), and D₂(μ) = D_m(μ). We apply the robust Huber norm over the data term where:

x∈=x222ϵifx2≤ϵ,x1-ϵ2otherwise (37)

The L22 norm works for small values, which is appropriate for handling Gaussian noise, getting smooth regions, while the L₁ norm allows discontinuities at depth edges (larger errors). Missing data in depth maps (undefined pixels) defines the inpainting domain:

wkμ∈0,1, (38)

where w(μ ) = 0 corresponds to pure inpainting at location μ . In consequence, the regularizer term allows smooth solutions and the data term allows solutions similar to the depth sources. The energy defined in eq. (36) is minimized using a first-order primal-dual algorithm.

3.6.1 Primal-Dual Algorithm

The regulariser and the data term of eq. (36) can be written in a more general form:

miny⁡FAy+∑k=12Gky, (39)

In our case, A=∇ the gradient operator, FAy=Ay1,Gky=λϖky-φkϵ,y,φk and ϖk are row-wise vector versions of the sought solution D_f(μ), the depth sources D_k(μ), and the matrix w_k that defines the inpainting domain, respectively. We do Legendre-Fenchel transformations in this way:

FAy=maxϱ2≤1⁡Ay,ϱ-F*ϱ,(40)

Gky=maxrk1≤λϖk⁡y-φk,rk-Gk*rk, (41)

where ϱ and r_k are dual variables associated to the primal variables y and φ_k respectively, and ⋅,⋅ corresponds to the dot product. Replacing (40) and (41) in (39) we get the primal-dual formulation of this nonlinear problem. It is a generic saddle-point problem:

miny⁡maxrk1≤λϖk,ϱ2≤1⁡Ay,ϱ-F*ϱ+∑k=12y-φk,rk-Gk*rk, (42)

where F*ϱ and Gk*rk are the convex conjugates of F(Ay ) and G_k(y ), respectively, and are defined as:

F*ϱ=δϱϱ, (43)

Gk*rk=δrkrk+ϵ2rk22, (44)

where δϱ and δrk are indicator functions of the convex sets, defined as:

δϱϱ=0ifϱ1≤1,∞otherwise (45)

δrkrk=0ifrk1≤λϖk,∞otherwise (46)

Following the algorithm 1 of ^[²^] and the parameter setting of ^[¹⁷^] we set the primal and dual time steps with τ = 0.05 and σ = 1/(8τ), respectively. The Huber norm parameter ϵ=0.1 and the balance λ = 1.2. We set the initial primal variable as y-0=φs since it is the most informative depth source. The dual variables r_k and ϱ are initialized with zeros.

Based on ^[²^], the iterative optimization corresponds to perform in alternating way gradient ascent over the dual variables and gradient descent over the primal variable, projecting the results onto the constraints and updating the primal variable, as is summarized next:

ϱn+1=projϱϱn+σAy-n,rkn+1=projrkrkn+σy-n-φk1+σϵk=1,2,yn+1=yn-τA*ϱn+1+∑k=12rkn+1,y-n+1=yn+1+Φyn+1-yn, (47)

where A* is the adjoint operator of the gradient operator and corresponds to the negative divergence operator, Φ=1,projϱ and projrk are projections of the dual variables ϱ and r_k, respectively, onto convex sets. They are defined for each element of the vectors as:

projϱϱ~=ϱ~max1,ϱ~, (48)

projrkr~k=r~kifr~k<λϖk,λϖkifr~k>λϖk,-λϖkifr~k<-λϖk. (49)

4 Results

For testing our optimization algorithms when one of the three variables is unknown (shape, pose and scale) we use a down-sampled point cloud coming from the vertices of the triangles computed with the marching cubes algorithm for the compressed version of the base model (fig. 2(h)).

When shape, pose and scale are unknown, we test our algorithms with data coming from the kinect 1.0 and with synthetic data got from a 3D scene. For these experiments we also carry out depth integration using our proposed variational formulation and quantify the completness and accuracy of the enhanced depth maps.

4.1 Results in Shape Optimization

The goal is to refine the shape associated to an initial latent variable in order to reduce the residual computed by evaluating the current embedding function in the point cloud (eq. (10)). In this experiment, the pose and scale are considered known. Figure 5(a) shows an explicit representation of the model for the initial latent variable.

Fig. 5 Evolution of the shape in the optimization process. The search is done in the latent space and the model is obtained with the inverse DCT. It takes 15 iterations for converging. The model is drawn in green and the point cloud in red

Figures 5(b) and 5(c) show the model for iterations 1 and 8, respectively. Figure 6 shows the path of the iterative search in the latent space. It starts with a latent variable χp=-0.2723 2.3048 and in iteration 15 χp has evolved to [0.500114 1.63515], with a ground truth of χp=0.3895 1.6162 Note that after iteration 8 ( χp=0.582774 1.62765) the energy diminishes slowly.

Fig. 6 Latent Space with the 49 latent variables (in black) used for learning the mapping parameters. Low variance regions are drawn in blue while high variance regions are drawn in red. More likely valid shapes are obtained in blue areas. The green point is the ground truth in latent variable. The search path of the latent variable that best fits to data is drawn in orange. The resulting latent variable is drawn in dark orange

The Euclidean distance between the estimated latent variable and the ground truth decreased from 0.9551 in the initialization to 0.1122 in iteration 15, getting a reduction in error of 88.25%.

4.2 Results in Pose Optimization

In this experiment the shape and scale are considered known and an initial pose is refined in order to align the embedding function with the point cloud. The ground truth in position is the origin and in orientation is the identity matrix. The initial position is t_wo = [-0.2 0.2 0.2] with an Euclidean distance to the origin of 0.3464. The initial orientation corresponds to a rotation of α_z = - 30 around z-axis w.r.t the ground truth.

Figure 7(a) shows the initial pose, while fig. 7(b) and 7(c) show the pose for iterations 20 and 50, respectively. The evolution of the Euclidean distance from the model to the origin and the evolution of the model orientation w.r.t. the ground truth are presented in fig. 8(a) and 8(b), respectively. The orientation R_wo is represented with rotations over the fixed axis x, y, and z (in this order).

Fig. 7 Evolution of the pose of the embedding function in the optimization process. The model is drawn in green and the point cloud in red

Fig. 8 Evolution of the position and orientation with respect to the ground truth. It takes 59 iterations for converging

The final position is t_wo = [0.0032 -0.0104 0.0035] , with an Euclidean distance to the origin of 0.0114. The final orientation is α_x = -0.0017^o, α_y = -0.0663^o, and α_z = -0.0209^o. It converges at iteration 59, achieving a reduction in position error of 96.70% and in rotation around z-axis of 99.93%.

4.3 Results in Scale Optimization

For scale optimization, the shape and pose are considered known. Recall that the same scale is used in the three dimensions. The initial value in scale is s = 0.4. The ground truth in scale is s = 1.0. Figure 9(a) shows the model with the initial scale (s = 0.4), meanwhile figs. 9(b) and 9(c) show the model with the scale of iterations 1 (s = 1.5855) and 10 (s = 1.0053), respectively. The error for iteration 10 is 0.53% of the ground truth. The reduction in error is 99.12%. Figure 10 shows how the scale gets close to the ground truth (s = 1.0).

Fig. 9 Evolution of the scale of the embedding function in the optimization process. The model is drawn in green and the point cloud in red

Fig. 10 Evolution of the scale. It takes 10 iterations for converging

4.4 Results in Pose, Scale and Shape, with Real Data

Now, we use a point cloud coming from a depth map taken with the Kinect V1 of Microsoft. The Kinect was moved by hand in front of a toy car while VGA images were captured at 30fps. We selected one depth and intensity image (see figs. 14(a) and 14(b)) and manually segmented the toy car, getting the mask of fig. 11(a). The resulting 3D points X_r associated to the segmented car (red points in fig. 11(b)) are referenced to the camera coordinate system, where the z-axis is perpendicular to the image plane.

Fig. 11 a) Mask of the segmented car. b) Point cloud associated to the segmented region (lateral side of the toy car)

Then, we compute 3D points referenced to the world coordinate system, X_w = M_wrX_r, where M_wr is a rotation of 90° around the y axis followed by a rotation of -90° around the new z axis. As was doing for model alignment in section 3.1, the point cloud of the toy car referenced to the world system X_w is translated from its center of gravity to the origin Xwc=Xw-X-w, and the average distance of each point to the new origin s_pc is computed using eq. (1). The same process is carried out for the vertices of the 3D model for getting s_m. The initial scale is computed with eq. (2).

The initial position of the car is defined as the center of gravity of the point cloud referenced to the world coordinate system, adding 4% of the x-component (z-axis in the camera coordinate system) to itself since the data belongs just to a side of the whole car:

two=1.04X-wxX-wyX-wz'. (50)

As in ^[⁶^], we assume that the toy car is over a flat surface, so computing the normal of this surface allows us to estimate an initial orientation of the car. This process begins by calculating 3D points for pixels in a region bellow the segmented area which belong to the supporting plane. Then, RANSAC is applied: three points are randomly selected and the parameters that define a plane are computed.

The points that fulfill with the computed plane (under a threshold) are counted. The random selection process was repeated 100 times and we chose the parameters that generated the largest number of inliers (points under a threshold when the plane is evaluated).

The resulting normal n , shown in figure 11(b), is used for computing two angles α and β that moves the embedding function to the supporting plane:

α=acosnznx2+nz2,β=acosnx2+nz2. (51)

The transformation consists of a rotation of -α over the y-axis followed by a rotation of β over the new x axis. The third angle γ needed for totally defining the orientation of the embedding function is the rotation around the unit normal (new z axis). This angle is found through exhaustive search with α between 10° and 360° and with an increment of 10° ; the selected α is the one that produces the minimum energy when eq. (10) is evaluated. Finally, the initial transformation matrix M_wo is:

Mwo=Rwotwo01×31, (52)

with Rwo=RotY,-αRotX',βRotZ'',γ. Summarizing, the initial position is t_wo = [0.8459 - 0.1524 - 0.2598], the initial orientation corresponds to α = 20.0017°, β = 2.18° and γ = -20°.

For comparing this orientation with the final one, we use rotations over fixed axis, x, y and z (in this order), with α_x = 9.1136°, α_y = -18.0095°, and α_z = -21.0621°. The initial scale is s = 0.1571, and the initial shape is the one associated to the reference model employed in the model alignment process χ=0.38951.6162. Figures 12(a) and 12(b) show the initial conditions for the model.

Fig. 12 Evolution of the pose, scale, and shape of the embedding function in the alternating optimization process, for real data using the Kinect. The point cloud is drawn in red

For this test, we carry out two cycles with the sequence: 20 iterations for pose and 5 iterations for scale. At the end of this sequence, 50 iterations have been done and very close pose and scale estimations are obtained (see 12(d)). With these estimations we can perform exhaustive search over the N_m = 49 models of cars used for learning the latent space. The latent variable with the 3D level set that produces the minimum energy χ=0.45602.4675. is used as initial value in the following refinement process. Finally, we carry out three cycles with the sequence: 5 iterations for shape, 10 iterations for pose and 5 iterations for scale, getting a refinement in pose and scale for a more approximated shape (see fig. 12(f)).

The final scale is s = 0.1602, and the final latent variable is χ=0.58582.6147. The final position is t_wo = [0.8499 - 0.1377 - 0.2530] and the final orientation is α_x = 5.7633°, α_y = -9.0734° and α_z = -26.0871°.

Figure 13 shows the evolution of the energy, drawing with red, green, and blue the pose, scale and shape optimization, respectively. The synthetic depth map got by reading the depth-buffer of OpenGL from the current camera pose is presented in fig. 14(c).

Fig. 13 Evolution of the energy for pose, scale, and shape optimization, in alternating way, using Levenberg Marquardt and the Kinect V1

Fig. 14 Sought depth map D_f(μ) for iteration (a) 10, (b) 40, (c) 100 and (e) 200 of the primal-dual algorithm

4.4.1 Results in Inpainting and Denoising

Since the main goal is to reduce noise and complete missing data in the depth map, especially in the region of the car, we get depth data of the optimal 3D model seen from the real camera pose. Figure 14(b) shows the depth map coming from the sensor while fig. 14(c) shows the depth map coming from the optimal 3D model. Note that the former image has missing data due to possible dark, very reflective and translucent objects, out of range objects or occlusions. The last image has defined data just for the 3D model. wsμ and wmμ defines the inpainting domain, where 1 represents defined data and 0 missing data (see eq. (38)).

Figure 14(d) shows the sought solution got with the optimization process for iteration 200. Note that the filled areas are smooth and undetectable, achieving a real appearance. The inpainting in the top and right side of the depth map is not so good since there is not data neither in D_s nor D_m and the missing data covers large areas.

We analyzed the performance of the inpainting and denoising algorithm using a slice around the x-axis (see fig. 15). Note that missing data stays mainly in the image borders, in the laptop region and in the car region (inside the vertical lines in fig. 15). It is 22.81% of the total amount of data in the slice.

Fig. 15 Depth map denoising and inpainting using shape priors and variational techniques along a slice. The vertical lines encompass the car

The solution remains similar to the sensor data but it fills missing data, integrates depth data coming from the model, is smooth but preserves discontinuities. Evaluating the completeness of the car, we can state that the algorithm fills 33.28% of missing data. This estimation was done for pixels belonging to the segmented area of the whole car.

4.5 Results in Pose, Scale and Shape with Synthetic Data

For this experiment we have created in Blender a 3D scene composed of a floor, a desk, a car, a mug and a lamp. A depth map (see fig. 17(f)) and a rgb image have been rendered from the virtual camera. The depth data is in the range [1.2045 8.9610]m. We added Gaussian white noise with SNR of 5dB to the rgb image (see fig. 17(a)) and 30dB to the depth map. Moreover, we simulated missing data in the depth maps with rectangles of 40 × 20 pixels containing nans, randomly placed in the depth maps. The percentages of missing data that we used were 23.65% and 40.02% (see figs. 17(b) and 17(g), respectively).

Fig. 17 The first figure in the first and second rows are the rgb image with noise and the ground truth in depth. For the remaining figures the first row contains images for the depth map with missing data of 23.65% while the second row for the depth map with missind data of 40.02%. From left to right: depth map with noise and missing data, depth map coming from the optimal model, our results, results using ^[²⁸^]

The process for estimating the optimal shape prior (initial estimation and the iterative refinement) and for getting a depth map of this model is similar to the one carried out previously for real data. Figures 16(a) and 16(b) show the aligned models with the point clouds for 23.65% and 40.02% of missing data, respectively.

Fig. 16 Results of the aligment process between a 3D-level set and the point cloud. a) For depth map with missing data of 23.65% and b) For depth map with missing data of 40.02%

Our algorithm performs 500 iterations for the depth map with 23.65% of missing data and 700 iterations for the one with 40.02%. We compare the results with the ones obtained with the outstanding system of Pertuz and Kamarainen ^[²⁸^], employing their code publicly available.

It uses anisotropic diffusion and a region-based approach for inpainting sparse depth maps. Table 1 shows that our algorithm outperforms ^[²⁸^] for both percentages of missing data. In fig 17 we can see that the inpainted and denoised depth map got with our algorithm is smoother and preserves discontinuities in a better way than ^[²⁸^]. We did not use more percentage of missing data since it affects the alignment of the model, producing incorrect estimations of the optimal model.

Table 1 Comparison of RMSE[m] of our algorithm and ^[²⁸^]

missing data - algorithm	Ours	^[²⁸^]
23.65%	0.1498	0.1600
40.02%	0.1869	0.2188

4.5.1 Time

Our system for inpainting and denoising runs on a laptop Hewlett Packard with a processor Intel Core i7-2.2GHz, 11.7GB of RAM memory, a graphic processor NVIDIA GEFORCE 755M and Ubuntu 14.04 as operating system. OpenCV is used for processing images, Eigen3 for matrix operations, OpenGL for drawing the 3D model and getting depth information from it, and Cuda 7.5 for speeding up the algorithm for merging the depth data coming from the sensor and from the optimal model.

The algorithm for estimating the optimal model was implemented in Matlab. Table 2 shows the average on processing time for the main processes. The implementation on GPU of the algorithm for estimating the optimal model is left for future work.

Table 2 Processing time for inpainting, denoising and merging

Process	Time[ms]
First line of eq. (47)	6.501
Second line of eq. (47)	5.373
Third line of eq. (47)	3.247
Remaining processes	0.573
TOTAL ITERATION	15.694

The algorithm takes 7.847s (500 iterations) for the depth map with 23.65% of missing data and 10.96s (700 iterations) for the depth map with 40.02% of missing data.

5 Conclusion

We have developed a system that optimizes a 3D model, represented as a 3D level-set embedding function, w.r.t. pose, scale and shape and uses depth data of the optimal model for depth denoising and inpainting of a raw depth map coming from a depth sensor, achieving significant results in completeness, specially in the area associated to the object of interest that has a high percentage of missing data. The results have been satisfactory.

First, the embedding function was successfully compressed from 128³ to 25³ DCT coefficients (compression ratio of 134,22) and reduced from 25³ to 2 dimensions (bi-dimensional latent space) using GPLVM, for a more efficient search of the optimal model.

Second, the optimization was outstanding: the shape optimization converge in 15 iterations with a reduction in error of 88.25% w.r.t. the initial estimation; the pose optimization converge in 59 iterations achieving a final error in position of 1.14cm and less that one tenth of degree in rotation around each axis (reduction in position error of 96.70% and in rotation error around z-axis of 99.93%) validating our proposed technique based on Lie algebra; the scale optimization converge in 10 iterations, getting a final error of 0.53% of the ground truth (reduction in scale error of 99.12%).

Third, the inpainting and denoising process using variational techniques achieved a significant increase in completeness of the depth map coming from the Kinect 1.0, filling 32.28% of missing data of the segmented region of the car, smoothing the data but preserving discontinuities. Moreover, the accuracy of the enhanced depth maps got from synthetic depth maps with 23.65% and 40.02% of missing data is higher than the one obtained with the outstanding algorithm for depth recovery ^[²⁸^].

For future work we will use shape priors and depth integration with variational techniques for enhancing the quality of depth maps built with a monocular camera and we will fuse them into a volumetric structure for getting an improved dense reconstruction.

Acknowledgment

The authors would like to thank Fundación CEIBA for the financial support that has made the development of this work possible.

References

1. Bao, S. Y., Chandraker, M., Lin, Y., & Savarese, S. (2013). Dense object reconstruction with semantic priors. Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR '13, IEEE Computer Society, Washington, DC, USA, pp. 1264-1271. [ Links ]

2. Chambolle, A. & Pock, T. (2011). A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, Vol. 1, No. 40, pp. 120-145. [ Links ]

3. Chaouch, M. & Verroust-Blondet, A. (2008). A novel method for alignment of 3d models. 2008 IEEE International Conference on Shape Modeling and Applications, pp. 187-195. [ Links ]

4. Chen, C., Jafari, R., & Kehtarnavaz, N. (2015). Improving human action recognition using fusion of depth camera and inertial sensors. IEEE Transactions on Human-Machine Systems, Vol. 45, No. 1, pp. 51-61. [ Links ]

5. Concha, A., Hussain, W., Montano, L., & Civera, J. (2015). Incorporating scene priors to dense monocular mapping. Autonomous Robots, Vol. 39, No. 3, pp. 279-292. [ Links ]

6. Dame, A., Prisacariu, V. A., Ren, C. Y., & Reid, I. (2013). Dense reconstruction using 3d object shape priors. 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1288-1295. [ Links ]

7. Drozdov, G., Shapiro, Y., & Gilboa, G. (2016). Robust recovery of heavily degraded depth measurements. 2016 Fourth International Conference on 3D Vision (3DV), pp. 56-65. [ Links ]

8. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, No. 9, pp. 1627-1645. [ Links ]

9. Feng, J., Price, B., Cohen, S., & Chang, S. F. (2016). Interactive segmentation on rgbd images via cue selection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 156-164. [ Links ]

10. Ferstl, D., Reinbacher, C., Ranftl, R., Ruether, M., & Bischof, H. (2013). Image guided depth upsampling using anisotropic total generalized variation. 2013 IEEE International Conference on Computer Vision, pp. 993-1000. [ Links ]

11. Fu, H., Xu, D., & Lin, S. (2017). Object-based multiple foreground segmentation in rgbd video. IEEE Transactions on Image Processing, Vol. 26, No. 3, pp. 1418-1427. [ Links ]

12. Hornácek, M., Rhemann, C., Gelautz, M., & Rother, C. (2013). Depth super resolution by rigid body self-similarity in 3d. 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1123-1130. [ Links ]

13. Jung, S. W. (2013). Enhancement of image and depth map using adaptive joint trilateral filter. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 23, No. 2, pp. 258-269. [ Links ]

14. Kerl, C., Sturm, J., & Cremers, D. (2013). Robust odometry estimation for rgb-d cameras. 2013 IEEE International Conference on Robotics and Automation, pp. 3748-3754. [ Links ]

15. Khayam, S. A. (2003). The discrete cosine transform (dct): Theory and application". department of electrical computing engineering. [ Links ]

16. Klein, G. & Murray, D. W. (2007). Parallel tracking and mapping for small AR workspaces. Proc. Sixth IEEE and ACM International Symposium on Mixed and Augmented Reality. [ Links ]

17. Kluckner, S., Pock, T., & Bischof, H. (2010). Exploiting redundancy for aerial image fusion using convex optimization. In Goesele, M., Roth, S., Kuijper, A., Schiele, B., & Schindler, K., editors, Pattern Recognition: 32nd DAGM Symposium. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 303-312. [ Links ]

18. Krenzin, J. & Hellwich, O. (2016). Reduction of point cloud artifacts using shape priors estimated with the gaussian process latent variable model. In Rosenhahn, B. & Andres, B., editors, Pattern Recognition: 38th German Conference, GCPR 2016, Hannover, Germany, September 12-15, 2016, Proceedings. Springer International Publishing, Cham, pp. 273-284. [ Links ]

19. Kwon, H., Tai, Y.-W., & Lin, S. (2015). Data-driven depth map refinement via multi-scale sparse representation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 159-167. [ Links ]

20. Lawrence, N. (2005). Probabilistic non-linear principal component analysis with gaussian process latent variable models. J. Mach. Learn. Res., Vol. 6, pp. 1783-1816. [ Links ]

21. Lin, D., Fidler, S., & Urtasun, R. (2013). Holistic scene understanding for 3d object detection with rgbd cameras. 2013 IEEE International Conference on Computer Vision, pp. 1417-1424. [ Links ]

22. Liu, M. Y., Tuzel, O., & Taguchi, Y. (2013). Joint geodesic upsampling of depth images. 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 169-176. [ Links ]

23. Liu, S., Wang, Y., Wang, H., & Pan, C. (2013). Kinect depth inpainting via graph laplacian with tv21 regularization. 2013 2nd IAPR Asian Conference on Pattern Recognition, pp. 251-255. [ Links ]

24. Moller, M. F. (1993). A scaled conjugate gradient algorithm for fast supervised learning. NEURAL NETWORKS, Vol. 6, No. 4, pp. 525-533. [ Links ]

25. Newcombe, R. A., Lovegrove, S. J., & Davison, A. J. (2012). DTAM: Dense tracking and mapping in real-time. [ Links ]

26. Ni, B., Moulin, P., Yang, X., & Yan, S. (2015). Motion part regularization: Improving action recognition via trajectory group selection. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3698-3706. [ Links ]

27. Park, J., Kim, H., Tai, Y.-W., Brown, M. S., & Kweon, I. (2011). High quality depth map upsampling for 3d-tof cameras. 2011 International Conference on Computer Vision, pp. 1623-1630. [ Links ]

28. Pertuz, S. & Kamarainen, J. (2017). Region-based Depth Recovery for HighlySparse Depth Maps. [ Links ]

29. Piniés, P., Paz, L. M., & Newman, P. (2015). Dense Mono Reconstruction: Living with the Pain of the Plain Plane. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA. [ Links ]

30. Ren, C. Y. & Reid, I. (2012). A unified energy minimization framework for model fitting in depth. Proceedings of the 12th International Conference on Computer Vision - Volume 2, ECCV'12, Springer-Verlag, Berlin, Heidelberg, pp. 72-82. [ Links ]

31. Rohr, K., Stiehl, H. S., Sprengel, R., Beil, W., Buzug, T. M., Weese, J., & Kuhn, M. H. (1996). Point-based elastic registration of medical image data using approximating thin-plate splines. In Höhne, K. H. & Kikinis, R., editors, Visualization in Biomedical Computing: 4th International Conference, VBC'96 Hamburg, Germamy, September 22-25, 1996 Proceedings. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 297-306. [ Links ]

32. Teng, Y., Zhang, Y., Chen, Y., & Ti, C. (2016). Adaptive morphological filtering method for structural fusion restoration of hyperspectral images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, Vol. 9, No. 2, pp. 655-667. [ Links ]

33. Ti, C., Xu, G., Guan, Y., & Teng, Y. (2017). Depth recovery for kinect sensor using contour-guided adaptive morphology filter. IEEE Sensors Journal, Vol. 17, No. 14, pp. 4534-4543. [ Links ]

34. Tosic, I. & Drewes, S. (2014). Learning joint intensity-depth sparse representations. IEEE Transactions on Image Processing , Vol. 23, No. 5, pp. 2122-2132. [ Links ]

35. Whelan, T., Kaess, M., Johannsson, H., Fallon, M., Leonard, J. J., & Mcdonald, J. (2015). Realtime large-scale dense rgb-d slam with volumetric fusion. Int. J. Rob. Res., Vol. 34, No. 4-5, pp. 598-626. [ Links ]

36. Zach, C., Pock, T., & Bischof, H. (2007). A globally optimal algorithm for robust tv-l1 range image integration. IEEE 11th International Conference on Computer Vision, pp. 1-8. [ Links ]

37. Zhu, H., Weibel, J. B., & Lu, S. (2016). Discriminative multi-modal feature fusion for rgbd indoor scene recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2969-2976. [ Links ]

Received: September 02, 2018; Accepted: November 11, 2019

^* Corresponding author is Andres Diaz. andres.a.diaz@correounivalle.edu.co

This is an open-access article distributed under the terms of the Creative Commons Attribution License