Semantic Metadata Generation

With the amount of digital image and video data in archives exploding and an ever increasing addition of newly produced and digitalized content, the challenge of keeping the content easily accessible is becoming more and more difficult. It is highly uneconomic to screen and annotate all content in person. Also, since current image and video search engines predominantly rely on metadata (such as manually assigned tags, file name, file description or text surrounding an image or video in the case of a web page) for retrieval, text descriptions which misrepresent the visual content can lead to inaccuracies. Thus, alternate approaches which rely on visual content understanding for automatic annotation and retrieval of images and videos are needed to address such inconsistencies.

The approach to automatically analyze and generate metadata for images and videos is performed in the following steps (cf. Fig. 1): extraction of features, training and classification. For each category (e.g. “beach”) to be learned by the semantic annotation system, a training set of images or video frames is required, which contains positive and negative samples of the category.

In the first step, local features are extracted from the images or video frames of a training set. These local features could be one or a combination of the different variants of the SIFT feature (such as OpponentSIFT, HSV-SIFT and RGB-SIFT). The features are extracted on a dense grid for each image or video frame of the training set. The result of the feature extraction process is a set of histograms of gradient directions, aligned to the main gradient direction, for each sampled point of the images or video frames of the training set. All the histograms computed from the training set are quantized, using a clustering algorithm into a predefined number of clusters, usually 4000. These clusters constitute the visual words of the visual codebook. For each image or video frame of the training set, a 4000-dimensional histogram is computed by assigning each gradient histogram of the feature points on the dense grid to a visual word of the codebook by using a nearest neighbor classifier.

In the next step, all 4000-dimensional vectors computed from the training set are used to train a classifier. The trained SVMs can then be used to classify images or video frames of a test set, as belonging to the learned category or not, by analyzing their features. A workflow of the semantic annotation system can be seen in figure 1.

This technology can be useful for

  • Augmenting broadcaster production and archiving workflows in a semi-automatic process, by aiding archivists/documentalists in the generation of video metadata
  • Automatic image/video annotation for efficient and reliable archiving
  • Semantic image/video search and retrieval

Publications

[1] Eugene Mbanya, Sebastian Gerke, and Patrick Ndjiki-Nya:
Spatial Codebooks for Image Categorization,
Proceedings of 2011 ACM International Conference on Multimedia Retrieval (ICMR2011), Trento, Italy, April 17-20, 2011.

[2] Eugene Mbanya, Sebastian Gerke, Christian Hentschel, and Patrick Ndjiki-Nya:
Sample Selection, Category Specific Features and Reasoning,
Proceedings of Conference on Multilingual and Multimodal Information Access Evaluation (ImageCLEF - Photo Annotation 2011), Amsterdam, Netherlands, September 2011.

[3] Jan Nandzik, Berenike Litz, Aenne Löhden, Andreas Heß, Iuliu Konya, Doris Baum, André Bergholz, Dirk Schönfuß, Christian Fey, Johannes Osterhoff, Jörg Waitelonis, Harald Sack, Ralf Köhler, and Patrick Ndjiki-Nya:
CONTENTUS – Technologies for Next Generation Multimedia Libraries, AIEMPro ’10 Special Issue, Springer Multimedia Tools and Applications, 2011.

[4] T. Becker, C. Burghart, K. Nazemi, Patrick Ndjiki, T. Riegel, and Ralf Schäfer:
Basistechnologien für das Internet der Dienste,
Internet der Dienste, Lutz Heuser/Wolfgang Wahlster (HRSG), acatech diskutiert, Springer Verlag, ISBN 978-3-642-21506-3, 2011.

[6] Eugene Mbanya, Christian Hentschel, Sebastian Gerke, Mohan Liu, Andreas Nürnberger, and Patrick Ndjiki-Nya:
Augmenting Bag-of-Words - Category Specific Features and Concept Reasoning,
Conference on Multilingual and Multimodal Information Access Evaluation (ImageCLEF - Photo Annotation ’10), Padua, Italy, September 2010.

[7] Jan Nandzik, Berenike Litz, Aenne Löhden, Andreas Heß, Iuliu Konya, Doris Baum, André Bergholz, Dirk Schönfuß, Christian Fey, Johannes Osterhoff, Jörg Waitelonis, Harald Sack, Ralf Köhler, and Patrick Ndjiki-Nya:
CONTENTUS – Technologies for Next Generation Multimedia Libraries,
International Workshop on Automated Information Extraction in Media Production (AIEMPro '10), Florence, Italy, October 2010.

Invited Talks

[1] Ralf Schäfer:
Theseus Project - Semantic and Image Processing Technologies and its Applications,
Theseus/ImageCLEF workshop on visual information retrieval evaluation, Corfu, Greek, September 29, 2009, Invited Keynote.

[2] Ralf Schäfer:
Semantic Annotation of Images and Videos,
DGA Workshop, Multimedia information processing ENSTA, Paris, France, July 1-7, 2011.