\(\textbf{Pano3D}\)

A Holistic Benchmark and a Solid Baseline for 360^o Depth Estimation

Georgios Albanis¹ Nikolaos Zioulis^{1, 2} Petros Drakoulis¹ Vasileios Gkitsas¹ Vladimiros Sterzentsenko¹ Federico Alvarez² Dimitrios Zarpalas¹ Petros Daras¹

¹ Centre for Research and Technology Hellas (CERTH), Thessaloniki, Greece ² Universidad Politécnica de Madrid (UPM), Madrid, Spain

OmniCV @ CVPR 2021

Paper
Code
Data
Web Demo

\(\textbf{Pano3D}\) is a new benchmark for depth estimation from spherical panoramas. Its goal is to drive progress for this task in a consistent and holistic manner. To achieve that we generate a new dataset and integrate evaluation metrics that capture not only depth performance, but also secondary traits like boundary preservation and smoothness. Moreover, \(\textbf{Pano3D}\) takes a step beyond typical intra-dataset evaluation schemes to inter-dataset performance assessment. By disentangling generalization to three different axes, \(\textbf{Pano3D}\) facilitates proper extrapolation assessment under different out-of-training data conditions. Relying on the \(\textbf{Pano3D}\) holistic benchmark for 360 depth estimation we perform an extended analysis and derive a solid baseline for the task.

Depth estimation performance evaluation

For evaluating depth from spherical panoramas we show that without proper weighting the metrics favour performance in distorted areas. Apart from the direct depth performance metrics, the \(\textbf{Pano3D}\) benchmark also includes implementations for metrics measuring depth boundary preservation and depth smoothness.. Finally, it also includes aggregated 3D metrics that consolidate boundary and smoothness errors in different ways that are more appropriate for different downstream tasks (e.g. view synthesis or 3D reconstruction).

Generalization capacity assessment

Most benchmarks focus on intra-dataset performance assessment, using a dataset’s train and test splits. Even though careful selection of the test samples can guarantee the quality of the evaluation, the dataset generation process may sometimes be biased due to inherent data collection reasons (e.g. same camera types, restricted availability of capture targets). To overcome such issues and take a step towards measuring progress in in-the-wild settings, we decompose generalization into three different axes: i) target depth distribution, ii) scene context, and iii) varying camera domain.

Data Generation

Using Matterport3D (M3D) [1] for training and the \(\textbf{Pano3D}\) GibsonV2 (GV2) [2] splits for testing, the benchmark delivers a zero-shot cross-dataset transfer evaluation that can be applied to different generalization settings on-demand. In addition, \(\textbf{Pano3D}\) offers renders in two resolutions (\(1024 \times 512\) and \(512 \times 256\)). We further release a big part of GibsonV2 that has not been used in the \(\textbf{Pano3D}\) testing splits, and can be used as additional training data.

GV2 Tiny split

GV2 Medium split

GV2 Full+ split

GV2 Filmic

M3D

Download

To download the \(\textbf{Pano3D}\) dataset we follow a two-step process:

Access to the \(\textbf{Pano3D}\) dataset requires agreement with the terms and conditions for each of the 3D datasets that were used to create (i.e. render) it, and more specifically, Matterport3D and GibsonV2. Therefore, in order to grant you access to this dataset, we need you to first fill request form.
Then, you need to perform a request for access to the respective Zenodo repositories, where the data are hosted (more information can be found in our download page). Due to data-size limitations, the dataset is split into six (6) repositories, which respectively contain the color image, depth and normal map renders for each image. The repositories are split into the two resolutions, with each subgroup of 3 repositories containing the entire Matterport3D dataset renders, the entire GibsonV2 test split renders, and the remainder of GibsonV2 which is used as additional training data. Therefore, a separate request for access needs to be made to each repository in order to download the corresponding data.

Note that only completing one step of the two (i.e. only filling out the form, or only requesting access from the Zenodo repositories) will not be enough to get access to the data. We will do our best to contact you in such cases and notify you to complete all steps as needed, but our mails may be lost (e.g. spam filters/folders). The only exception to this, is if you have already filled in the form and need access to another Zenodo repository (for example you need extra dataset/splits which are hosted on different Zenodo repositories), then you only need to fill in the Zenodo request but please, make sure to mention that the form has already been filled in so that we can verify it.
Each volume is a multi-part archive and is broken down in several .7z files (2GB or 4GB each) for more convenient downloading on low bandwidth connections. You need all the .7z archives of each volume in order to extract the containing files.

Searching for a solid baseline

Architecture

The \(\textbf{Pano3D}\) baseline search relies on single-pass autoencoder architectures supervised by a weighted combination of different loss functions, each focusing on a specific depth map trait. For the autoencoders we use a simple convolution decoder and focus our search on the encoder part, and specifically:

A standard ResNet-152 encoder [3] with \(110M\) parameters

A standard DenseNet-161 encoder [4] with \(55M\) parameters

A neural architecture search encoder, PNAS [5] with \(99M\) parameters

In addition, our search also considers architectures with encoder-decoder skip connections:

A traditional UNet [6] with \(27M\) parameters

A customized ResNet-152 autoencoder with UNet-like skip connections starting from the first residual block and \(112M\) parameters

Losses

For the baseline search, we build upon prior literature regarding depth regression losses [14] and consider standard supporting losses, as well as a recently presented globalized loss:

A \(L_1\) depth error, which is the better performing direct objective [7], supported by:
- A multi-scale gradient maching term [8], that aims at preserving boundaries (\(L_{grad}\))
- A surface orientation error term [9], that aims at minimizing the cosine distance between normals (\(L_{cosine}\))
A combined objective (\(L_{comb})\) for direct depth performance, boundary preservation and surface smoothness,
The above combined error is further supported by a global virtual normal loss (VNL) (\(L_{vnl}\)) [10]

Metrics

Direct [11, 12]	Boundary
\(\textit{(w)RMSE}\) \(\textit{(w)RMSLE}\) \(\textit{AbsRel}\) \(\textit{SqRel}\) \((w)\delta_{\{1.05, 1.0, 1.25, 1.25^2, 1.25^3\}}\)	\(\textit{prec}_{\{0.25, 0.5, 1.0\}}\) [13] \(\textit{rec}_{\{0.25, 0.5, 1.0\}}\) [13] \(\textit{dbe}^{\{acc, comp\}}\) [14]
Geometrical	Smoothness [9]
\(\textit{c2c}\) [15] \(\textit{m2m}\) [16]	\(\textit{RMSE}^o\) \(\alpha_{\{11.25^o, 22.50^o, 30.00^o\}}\)

Indicators

To holistically assess the different models, we employ performance indicators that aggregate metrics across the different dimensions:

\(\Large i_d = \frac{1}{(1 - \delta_{1.25}) \times RMSE}\),
\(\Large i_b = \frac{1}{(1 - \frac{(F_{0.25} + F_{0.5} + F_{1.0})}{3}) \times dbe^{acc}}\), where \(F_{t}=2\frac{prec_{t} \times rec_{t}}{prec_{t} + rec_{t}}\),
\(\Large i_s = \frac{1}{(1 - \frac{(\alpha_{11.25^o} + \alpha_{22.5^o} + \alpha_{30^o})}{3}) \times RMSE^o}\),

For the direct depth performance, it is observed that generally the combined loss leads to improved performance (with a notable exception being the ResNet_skip model). Interestingly the VNL only improves the performance of the UNet model, which can be attributed to the combined effect of a smaller receptive field than the other models (inferior global context capture), and of the localised prediction effect that the skip connections introduce.

As expected, the smoothness objective hurts boundary preservation while the gradient matching term boosts, with UNet also benefiting from the VNL loss in terms of preserving boundaries.

Opposite from the boundary, the gradient matching term hurts smoothness when used in isolation, but the smoothness term helps balance its effect, and even benefits from their combined use. Again, UNet is the only architecture to consistently benefit from VNL across all indicators

Best Models per Architecture

From the above analysis, we define the best performing models of each architecture, with the only conflicting choice being the ResNet_skip selected model where a balanced performer was chosen. Qualitative results follow with the images on the left allowing for a transition between the input color image and the normal maps from the predicted depth, accompanied by Poisson 3D reconstruction [17] of the estimated depth maps.

\(\color{#E3D10A}{ResNet}\)	\(\color{#800080}{DenseNet}\)

\(\color{#00FFFF}{PNAS}\)	\(\color{#FF00FF}{ResNet_{skip}}\)

\(\color{#FFA600}{UNet}\)

Studying generalization

Three types of distribution shift

Going beyond zero-shot cross-dataset transfer [23] and generalized distribution shift, we identify three types of distribution shift [24,25,26] covariate, prior and concept, and design an experimental framework to study the impact of each type in isolation or combined. This way, we can investigate the apparent stackability of various shifts, measure the effectivenes of common mitigation techniques, such as pretraining and photometric augmentation, and unfold their performance envelope in a granular and precisely controlled way gushing valuable insight.

Targeted-shift dataset creation

Exploiting data selection and rendering synthesis, we crafted five test-sets that, in conjunction with the original M3D train-set, exhibit targeted distribution shift along one or more of the aforementioned axes.

Covariate shift concerns the input domain. In our case, this corresponds to the color of the input images. By rendering the original M3D test-set with the Blender™'s filmic-view transform, we create a split that is identical to the original but differs only in the pixel-color statistics, transpiring a cinematic sense.

Prior shift concerns the output domain. In our case, this corresponds to the depth distribution of the scenes involved and as a consequence, the respective predictions of the models to be output. After analyzing the GV2 dataset, we identified a data-split that exhibits significantly different depth distribution compared to the original M3D train-set. By rendering it with the standard color transfer function, the same we used to render the original M3D train-set, we create a split that is kindred to that, apart from their depth statistics.

Concept shift refers to the relationship between the input and output distributions. In our case, this translates to the type of the buildings presented in our data. While the original M3D train-set is comprised of residential buildings, we identified a GV2 data-split that despite exhibiting similar depth-statistics, it includes industrial and other non-residential premises (i.e. garages, commercial stores and under-construction sites) as well. By rendering this split in the usual manner and using it to test a model trained on the original M3D train-set, we can closely emulate a concept drift scenario.

Lastly, by manipulating rendering and data-splitting the way we showcased, we created two more test-sets exhibiting combined shift of either covariate & prior or covariate & concept, enabling the study of their theorized stackability.

All the aforementioned data-spits can be found in our download page.

Benchmarking

Using our composite performance indicators, we put to the test all five data-splits, evaluating the baseline UNet model. Initially, we evaluate the performance drop each distribution shift bears, in regards to all three quality indicators analyzed. Then, we apply two commonly exercised methods for addressing distribution-shift, pretraining and photometric augmentation, and decompose their apparent beneficiary behaviour in an attempt to deeply understand and reason the underlying workings of their success.

Performance across distribution shifts

Distribution shift of any kind degrades performance.
Concept shift is the most challenging to mitigate.
Shifts are, evidently, stackable.

Photometric augmentations

Significant performance boost when covariate shift manifests.
Boost is higher when covariate shift is dealt in isolation but lower when it is mixed with others.

Pretrained initialization

We employ the PNAS model pretrained on ImageNet.
Pretraining boosts direct performance on the original test-set.
It mitigates covariate shift in regard to all quality traits, but only in isolation, as combinations proved too hard.
Evidently, the model only benefits from higher quality parameter initialization rather than gaining generalization capacity.

Refining Depth Estimates

Taking into account the developments for depth refinement, \(\textbf{Pano3D}\) also includes an analysis of a recent work using displacement fields [18], which is properly adapted to the spherical domain, to periodic displacement fields.
We use a specialized guided stacked hourglass architecture as a refinement module that is trained using a pretrained depth model. Apart from the dual (guided) input encoder path, the guided stacked hourglass model exchanges information between the color and depth features using Adaptive Instance Normalization (AdaIn) [19].

The periodic displacement fields consistently improve the boundary preservation performance of all models apart from the UNet one, which nonetheless, already exhibits the best boundary preservation performance. Qualitative samples overlaying the detected boundaries for selected models are illustrated below:

Boundary preservation qualitative comparison, from left to right: \(\textbf{i)}\) \(\color{#00cc99}{GT depth}\), \(\textbf{ii)}\) \(\color{#FFA600}{ UNet}\), \(\textbf{iii)}\) \(\color{#00FFFF}{PNAS}\), \(\textbf{iv)}\) \(\color{#E3D10A}{ResNet}\), and \(\textbf{v)}\) \(\color{#FF00FF}{ResNet_{skip}}\).

Comparisons

Overall, the \(\textbf{Pano3D}\) baseline search shows that skip connections offer higher boundary preservation performance, naturally at the expense of smoothness, but their direct depth estimation performance does not suffer from this. The following comparison between the UNet and PNAS architecture (used in [20]) shows this different, with the advantage figure on the right (similar to [21]) illustrating the areas where each model performers better than the other

\(\color{#E3D10A}{ResNet}\) \(vs\) \(\color{#FF00FF}{ResNet_{skip}}\)

Qualitative comparison between \(\color{#E3D10A}{ResNet}\) \(and\) \(\color{#FF00FF}{ResNet_{skip}}\). It is apparent that the addition of skip connections allows \(\color{#FF00FF}{ResNet_{skip}}\) to capture finer-grained details.

\(\color{#FFA600}{UNet}\) \(vs\) \(\color{#00FFFF}{PNAS}\)

Qualitative comparison between \(\color{#FFA600}{UNet}\) and \(\color{#00FFFF}{PNAS}\). Apparently, \(\color{#00FFFF}{PNAS}\) provides smoother results while it is clear that \(\color{#FFA600}{UNet}\) is able to capture finer-grained details.

In-the-wild Results

The \(\textbf{Pano3D}\) baseline is a solid panorama depth estimation model that is positioned favourably against the state-of-the-art, with the following samples showing the BiFuse [22] predictions, compared to the UNet ones, when applied to in-the-wild panoramas acquired via both 360 cameras and stitched mobile phone captures.