Introduction and Motivation
Relighting of portrait images is a common task that can be used in many different applications e.g. for personal use, changing image media, or applying correct illumination in AR environments and avatar creation. Another potential use is correcting the light situation for security checks, for example, at border gates at the airport. Here, changing daylight, positioning of gates, and the random direction in which a subject approaches the camera will lead to different illuminations while recording an image to be processed. The different illumination conditions in the source and the target image complicate the detection rate. Our target is the relighting of portrait images for security applications, thus focusing on images that are recorded in environments that are moderately controlled, greatly reducing the interference from the outside world.
Due to the unavailability of large-scale ground truth dataset of portrait images with different controlled illumination settings, we captured an own dataset to train our model. Our light stage setup consists of a circular grid with 70 lamps surrounding the recorded person. For each participant, we create a "one-light-on-at-a-time" image series in three different light colors.Each light situation is captrued with a multi-view camera setup consisting of seven cameras, with approximately 15-degree angle to each other. Thus, in approximately 5 min 220 images in seven cameras resulting in 1540 images per participant are recorded.
Inspired by recent Deep Image Relighting approaches, we use a Basic U-Net structure as the base network (see below). The target light is injected into the encoded part of the network by a second small network. Here, the geometry and image information is split from the shading and light information. Disentangling these follows the main target of the relighting task that image details will stay the same while only the light should be changed.
For further disentanglement, we propose to assume a multiplicative image formation model. Instead of letting the network predict the target image, the task is to predict a multiplication matrix that transforms the input image to the output image by pixel-wise mutliplication. This approach makes the model more robust and forces the network to concentrate on differences created be the different lighting situations rather that facial details.
Furthermore, the utilization of a multiplicative image formation model allows the incorporatrion of additional constraints to guide the training. One example is the assupmtion that shading varies smoothly. Hence, we use a variation loss enforcing local smoothness of the multiplicative mask and reducing the introduction of noise, thereby reducing artefacts in the final image.
One of the main advantages is of our problem formulation is that the operation is invertable, which can be exploited during training by introducing a second inverse pass, thereby implicitely increasing the number of available training pairs.
During recording, movements cannot be completely avoided due to the time passing in-between recordings. Hence the trainign pairs are not perfectly aligned making generalization difficult. We proposea mption compensated loss inspired by classical block matching. This loss compares windows of the generated image with the ground truth in a search widnow and calculates the loss on the best fitting position.
F. Schreiber, A. Hilsmann, P. Eisert, Model-Based Deep Portrait Relighting, 19th ACM SIGGRAPH European Conference on Visual Media Production London, UK, Dec. 2022 doi: 10.1145/3565516.3565526