Neural Face Modelling and Animation

We present a practical framework for the automatic creation of animatable human face models from calibrated multi-view data. Using deep neural networks, we are able to combine classical computer graphics models with image based animation techniques. Based on captured multi-view video footage, we learn a compact latent representation of facial expressions by training a variational auto-encoder on textured mesh sequences.



We capture face geometry with a simple linear model that represents rigid motion as well as large-scale deformation of the face. Fine details as well as the appearance of complex face areas (e.g. mouth, eyes) are mainly captured in texture-space.

Our facial performance capture process outputs textured mesh sequences with constant topology. Based on these textured mesh sequences, we learn a latent representation of facial expressions with a variational auto-encoder (VAE). By simultaneously training a GAN loss-function, we force the texture decoder to produce highly detailed textures that are almost indistinguishable from original ones. The VAE serves now as a neural face model, which synthesises consistent face geometry and texture according to a low-dimensional expression vector. Instead of training one neural model for whole face, we train multiple local models for different areas, like eyes and mouth.


Based on our neural face model, we develop novel animation approaches. For instance, an example-based method for visual speech synthesis. Example-based animation approaches use short samples of captured motion (e.g. talking or changes of the facial expression) to create new facial performances by concatenating or looping them. Our method uses short sequences of speech (i.e. dynamic visemes) from a database in order to synthesise visual speech for arbitrary words that have not been captured before. Our neural face model offers several advantages that increase the memory-efficiency and improve the visual quality of generated facial animations. Instead of working with original texture and mesh-sequences, we can store motion samples as sequences of latent expressions vectors. This helps reducing memory requirements by a large margin and eases the concatenation of dynamic visemes as linear interpolation in latent space yields realistic and artefact-free transitions.




W. Paier, A. Hilsmann, P. Eisert
Neural Face Models for Example-Based Visual Speech Synthesis,
Proc. of the 17th ACM SIGGRAPH Europ. Conf. on Visual Media Production (CVMP 2020) London, UK, Dec. 2020. Best Paper Award

W. Paier, A. Hilsmann, P. Eisert
Interactive Facial Animation with Deep Neural Networks,
IET Computer Vision, Special Issue on Computer Vision for the Creative Industries, May 2020, Doi: 10.1049/iet-cvi.2019.0790

Related Projects

This topic is funded in the projects Content4All and INVICTUS.