# A Holistic Benchmark and a Solid Baseline for 360o Depth Estimation

1 Centre for Research and Technology Hellas (CERTH), Thessaloniki, Greece     2 Universidad Politécnica de Madrid (UPM), Madrid, Spain

$\textbf{Pano3D}$ is a new benchmark for depth estimation from spherical panoramas. Its goal is to drive progress for this task in a consistent and holistic manner. To achieve that we generate a new dataset and integrate evaluation metrics that capture not only depth performance, but also secondary traits like boundary preservation and smoothness. Moreover, $\textbf{Pano3D}$ takes a step beyond typical intra-dataset evaluation schemes to inter-dataset performance assessment. By disentangling generalization to three different axes, $\textbf{Pano3D}$ facilitates proper extrapolation assessment under different out-of-training data conditions. Relying on the $\textbf{Pano3D}$ holistic benchmark for 360 depth estimation we perform an extended analysis and derive a solid baseline for the task.

Depth estimation performance evaluation

For evaluating depth from spherical panoramas we show that without proper weighting the metrics favour performance in distorted areas. Apart from the direct depth performance metrics, the $\textbf{Pano3D}$ benchmark also includes implementations for metrics measuring depth boundary preservation and depth smoothness.. Finally, it also includes aggregated 3D metrics that consolidate boundary and smoothness errors in different ways that are more appropriate for different downstream tasks (e.g. view synthesis or 3D reconstruction).

Generalization capacity assessment

Most benchmarks focus on intra-dataset performance assessment, using a dataset’s train and test splits. Even though careful selection of the test samples can guarantee the quality of the evaluation, the dataset generation process may sometimes be biased due to inherent data collection reasons (e.g. same camera types, restricted availability of capture targets). To overcome such issues and take a step towards measuring progress in in-the-wild settings, we decompose generalization into three different axes: i) target depth distribution, ii) scene context, and iii) varying camera domain.

Data Generation

Using Matterport3D [1] for training and the $\textbf{Pano3D}$ GibsonV2 [2] splits for testing, the benchmark delivers a zero-shot cross-dataset transfer evaluation that can be applied to different generalization settings on-demand. In addition, $\textbf{Pano3D}$ offers renders in two resolutions ($1024 \times 512$ and $512 \times 256$). We further release a big part of GibsonV2 that has not been used in the $\textbf{Pano3D}$ testing splits, and can be used as additional training data.

GV2 Tiny split
GV2 Medium split
GV2 Full+ split
GV2 Filmic
M3D

To download the $\textbf{Pano3D}$ dataset we follow a two-step process:

1. Access to the $\textbf{Pano3D}$ dataset requires agreement with the terms and conditions for each of the 3D datasets that were used to create (i.e. render) it, and more specifically, Matterport3D and GibsonV2. Therefore, in order to grant you access to this dataset, we need you to first fill request form.
2. Then, you need to perform a request for access to the respective Zenodo repositories, where the data are hosted (more information can be found in our download page). Due to data-size limitations, the dataset is split into six (6) repositories, which respectively contain the color image, depth and normal map renders for each image. The repositories are split into the two resolutions, with each subgroup of 3 repositories containing the entire Matterport3D dataset renders, the entire GibsonV2 test split renders, and the remainder of GibsonV2 which is used as additional training data. Therefore, a separate request for access needs to be made to each repository in order to download the corresponding data.
Note that only completing one step of the two (i.e. only filling out the form, or only requesting access from the Zenodo repositories) will not be enough to get access to the data. We will do our best to contact you in such cases and notify you to complete all steps as needed, but our mails may be lost (e.g. spam filters/folders). The only exception to this, is if you have already filled in the form and need access to another Zenodo repository (for example you need extra dataset/splits which are hosted on different Zenodo repositories), then you only need to fill in the Zenodo request but please, make sure to mention that the form has already been filled in so that we can verify it.
Each volume is a multi-part archive and is broken down in several .7z files (2GB or 4GB each) for more convenient downloading on low bandwidth connections. You need all the .7z archives of each volume in order to extract the containing files.

Searching for a solid baseline

### Architecture

The $\textbf{Pano3D}$ baseline search relies on single-pass autoencoder architectures supervised by a weighted combination of different loss functions, each focusing on a specific depth map trait. For the autoencoders we use a simple convolution decoder and focus our search on the encoder part, and specifically:

• A standard ResNet-152 encoder [3] with $110M$ parameters
• A standard DenseNet-161 encoder [4] with $55M$ parameters
• A neural architecture search encoder, PNAS [5] with $99M$ parameters
In addition, our search also considers architectures with encoder-decoder skip connections:
• A traditional UNet [6] with $27M$ parameters
• A customized ResNet-152 autoencoder with UNet-like skip connections starting from the first residual block and $112M$ parameters

### Losses

For the baseline search, we build upon prior literature regarding depth regression losses [14] and consider standard supporting losses, as well as a recently presented globalized loss:
• A $L_1$ depth error, which is the better performing direct objective [7], supported by:
• A multi-scale gradient maching term [8], that aims at preserving boundaries ($L_{grad}$)
• A surface orientation error term [9], that aims at minimizing the cosine distance between normals ($L_{cosine}$)
• A combined objective ($L_{comb})$ for direct depth performance, boundary preservation and surface smoothness,
• The above combined error is further supported by a global virtual normal loss (VNL) ($L_{vnl}$) [10]

### Metrics

 Direct [11, 12] Boundary $\textit{(w)RMSE}$ $\textit{(w)RMSLE}$ $\textit{AbsRel}$ $\textit{SqRel}$ $(w)\delta_{\{1.05, 1.0, 1.25, 1.25^2, 1.25^3\}}$ $\textit{prec}_{\{0.25, 0.5, 1.0\}}$ [13] $\textit{rec}_{\{0.25, 0.5, 1.0\}}$ [13] $\textit{dbe}^{\{acc, comp\}}$ [14] Geometrical Smoothness [9] $\textit{c2c}$ [15] $\textit{m2m}$ [16] $\textit{RMSE}^o$ $\alpha_{\{11.25^o, 22.50^o, 30.00^o\}}$

### Indicators

To holistically assess the different models, we employ performance indicators that aggregate metrics across the different dimensions:
• $\Large i_d = \frac{1}{(1 - \delta_{1.25}) \times RMSE}$,
• $\Large i_b = \frac{1}{(1 - \frac{(F_{0.25} + F_{0.5} + F_{1.0})}{3}) \times dbe^{acc}}$, where $F_{t}=2\frac{prec_{t} \times rec_{t}}{prec_{t} + rec_{t}}$,
• $\Large i_s = \frac{1}{(1 - \frac{(\alpha_{11.25^o} + \alpha_{22.5^o} + \alpha_{30^o})}{3}) \times RMSE^o}$,

### Best Models per Architecture

From the above analysis, we define the best performing models of each architecture, with the only conflicting choice being the ResNetskip selected model where a balanced performer was chosen. Qualitative results follow with the images on the left allowing for a transition between the input color image and the normal maps from the predicted depth, accompanied by Poisson 3D reconstruction [17] of the estimated depth maps.

$\color{#E3D10A}{ResNet}$ $\color{#800080}{DenseNet}$
$\color{#00FFFF}{PNAS}$ $\color{#FF00FF}{ResNet_{skip}}$
$\color{#FFA600}{UNet}$

Refining Depth Estimates

Taking into account the developments for depth refinement, $\textbf{Pano3D}$ also includes an analysis of a recent work using displacement fields [18], which is properly adapted to the spherical domain, to periodic displacement fields.
We use a specialized guided stacked hourglass architecture as a refinement module that is trained using a pretrained depth model. Apart from the dual (guided) input encoder path, the guided stacked hourglass model exchanges information between the color and depth features using Adaptive Instance Normalization (AdaIn) [19].

The periodic displacement fields consistently improve the boundary preservation performance of all models apart from the UNet one, which nonetheless, already exhibits the best boundary preservation performance. Qualitative samples overlaying the detected boundaries for selected models are illustrated below:

Boundary preservation qualitative comparison, from left to right: $\textbf{i)}$ $\color{#00cc99}{GT depth}$, $\textbf{ii)}$ $\color{#FFA600}{ UNet}$, $\textbf{iii)}$ $\color{#00FFFF}{PNAS}$, $\textbf{iv)}$ $\color{#E3D10A}{ResNet}$, and $\textbf{v)}$ $\color{#FF00FF}{ResNet_{skip}}$.

Comparisons
Overall, the $\textbf{Pano3D}$ baseline search shows that skip connections offer higher boundary preservation performance, naturally at the expense of smoothness, but their direct depth estimation performance does not suffer from this. The following comparison between the UNet and PNAS architecture (used in [20]) shows this different, with the advantage figure on the right (similar to [21]) illustrating the areas where each model performers better than the other

### $\color{#E3D10A}{ResNet}$ $vs$ $\color{#FF00FF}{ResNet_{skip}}$

Qualitative comparison between $\color{#E3D10A}{ResNet}$ $and$ $\color{#FF00FF}{ResNet_{skip}}$. It is apparent that the addition of skip connections allows $\color{#FF00FF}{ResNet_{skip}}$ to capture finer-grained details.

### $\color{#FFA600}{UNet}$ $vs$ $\color{#00FFFF}{PNAS}$

Qualitative comparison between $\color{#FFA600}{UNet}$ and $\color{#00FFFF}{PNAS}$. Apparently, $\color{#00FFFF}{PNAS}$ provides smoother results while it is clear that $\color{#FFA600}{UNet}$ is able to capture finer-grained details.

In-the-wild Results

The $\textbf{Pano3D}$ baseline is a solid panorama depth estimation model that is positioned favourably against the state-of-the-art, with the following samples showing the BiFuse [22] predictions, compared to the UNet ones, when applied to in-the-wild panoramas acquired via both 360 cameras and stitched mobile phone captures.

### $\color{#FFA600}{UNet}$ $vs$ BiFuse

Qualitative comparison between $\color{#FFA600}{UNet}$ (on the left column) $and$ $\color{#fa8ef9}{BiFuse}$ [22] (on the right column).

Acknowledgements
This project has received funding from the European Union’s Horizon 2020 innovation programme ATLANTIS under grant agreement No 951900.
References