Sparse keypoint correspondences are needed for many Computer Vision applications like 3D reconstruction, image-based rendering, tracking, augmented reality or object detection. The task is particularly challenging, if the input data contains dynamic content. The most common approach to address this problem is to employ a keypoint detector and descriptor like SIFT. However, depending on factors like baseline, texturization, scene depth and accuracy of nearest neighbor calculation, the distribution and reliability of the results vary significantly (second illustration in the image below). One typical approach to improve results is to recover the underlying global model, e.g. the fundamental matrix, by applying a RANSAC method on the initial matching output and using it to remove outliers. However, this method can only capture global rigid motion and has high sensitivity to noise (third illustration in the image below).
Therefore, our approach tackles the problem of sparse tracking of dynamic image content by applying a local model. Basic descriptor matching is used to identify match candidates and a selection of initial correspondences. For the points associated this initialisation, a dynamic Delaunay triangulation introduces a natural neighborhood and affine transformations between the triangles. Based on affine similarities to its neighborhood in the current selection, each match (candidate) can be evaluated. An iterative optimization procedure maximizes the number of selected correspondences while retaining local affinity between the neighbors. For all changes on the selection, an extended version of the dynamic Delaunay triangulation enables fast and localized recalculations of the affinities. The optimization results in a decreased number of incorrect correspondences and a significant increase in the total number of matches (fourth illustration in the image above). While not being as fast and simple as outlier rejection using epipolar geometry, the runtime of our approach is still by far governed by the keypoint extraction and basic matching process. The only assumption we make is local smoothness of the scene content between the images.