Instead of relying on active depth sensors, we develop a professional end-to-end volumetric video production pipeline to achieve high-accuracy human body reconstruction using only passive cameras. While current volumetric video approaches estimate depth map using traditional stereo matching techniques, we introduce and optimize deep learning-based multi-view stereo networks Vis-MVSNet for depth map estimation in the context of professional volumetric video reconstruction. In order to fine-tune the network model to further adapt it to our volumetric reconstruction context, we create a 3D human body dataset captured in our volumetric studio context, called HBR dataset.
Furthermore, we propose a novel depth map post-processing approach including filtering and fusion, by taking into account photometric confidence, cross-view geometric consistency, foreground masks as well as camera viewing frustums. We show that our method can generate high levels of geometric detail for reconstructed human bodies.