Image Processing Multi-Frame Motion-Compensated Prediction

1. Long-Term Memory Motion-Compensated Prediction
Long-term memory motion-compensated prediction extends the spatial displacement vector utilized in block-based hybrid video coding by a variable time delay permitting the use of more frames than the previously decoded one for motion compensated prediction. The long-term memory covers several seconds of decoded frames at encoder and decoder. The use of multiple frames for motion compensation in most cases provides significantly improved prediction gain. The variable time delay has to be transmitted as side information requiring additional bit-rate which may be prohibitive when the size of the long-term memory becomes too large. Therefore, we control the bit-rate of the motion information by employing rate-constrained motion estimation. Simulation results are obtained by integrating long-term memory prediction into an H.263 codec. Reconstruction PSNR improvements up to 1.5 dB for the Foreman sequence and 0.9 dB for the Mother-Daughter sequence are demonstrated in comparison to the TMN-1 0 H.263 coder. The PSNR improvements correspond to bit-rate savings up to 23 % and 17 %, respectively. Mathematical inequalities are used to speed-up motion estimation while achieving full prediction gain. Long-term memory prediction can also be benefitially applied for the transmission over error prione channels. We present a framework that incorporates an estimated error into rate-constrained motion estimation and mode decision. Experimental results with a Rayleigh fading channel show that long-term memory prediction significantly outperforms the single-frame prediction H.263-based anchor. When a feedback channel is available, the decoder can inform the encoder about successful or unsuccessful transmission events by sending positive (ACK) or negative (NACK) acknowledgments. This information is utilized for updating the error estimates at the encoder. Similar concepts such as the ACK and NACK mode known from the H.263 standard are unified into a general framework providing superior transmission performance.
The long-term memory prediction scheme has been proposed to ITU-T/SG16/Q15. Various submissions have been made to that group which decided at the February 1999 meeting in Moterey, CA, USA, to adopt the feature. The name Enhanced Reference Picture Selection has been coined for the scheme since there already exists an Annex that utilized several frames for increased error resilience with the name Reference Picture Selection. Enhanced Reference Picture Selection is planned to be included as an integral part of the ITU-T Recommendation H.263 as Annex U. You can download the latest version of draft Annex U here Annex U.
2. Affine Multi-Frame Motion-Compensated Prediction
Multi-frame affine prediction extends motion compensation from the previous frame to several past decoded frames and warped versions thereof. Affine motion parameters describe the warping. In contrast to translational motion compensation, the affine motion parameters must be assigned to large image segments to obtain a rate-distortion efficient motion representation. These large image segments usually can not be chosen so as to partition the image uniformly. Hence, encoding proceeds in four steps: (i) estimation of several affine motion parameter sets between the current and previous frames, (ii) generating the multi-frame buffer consisting of past decoded frames and affine warped frames, (iii) multi-frame block-based hybrid video encoding, and (iv) determination of the efficient number of motion models using Lagrangian optimization techniques. A significant improvement in coding efficiency can be observed when comparing the multi-frame affine motion coder to the TMN-10 coder, the rat e-distortion optimized test model of the H.263 standard. At a fixed quality of 34 dB PSNR, the proposed coder achieves 24 % bit-rate reduction over a set of 8 different test sequences. The bit-rate savings inside this set of test sequences vary from 35 % to 15 % which correspond to PSNR gains of 3 dB and 0.8 dB, respectively. It is shown that both concepts, affine motion and long-term memory prediction, contribute to the overall gain.
3. Multi-Hypothesis Motion-Compensated Prediction
Multi-hypothesis prediction extends motion compensation with one prediction signal to the linear superposition of several motion-compensated prediction signals with the result of increased coding efficiency. The multiple hypotheses in this paper are blocks in past decoded frames. These blocks are referenced by individual motion vectors and picture reference parameters incorporating long-term memory motion-compensated prediction. In this work, we at most employ two hypotheses similar to B-frames. However, they are obtained from the past. Due to the increased rate for the motion vectors, rate-constrained coder control is utilized. On this basis, we demonstrate the efficiency of multi-hypothesis prediction in combination with variable block size and long-term memory and present bit-rate savings up to 32%. It turns out that the use of multiple reference frames enhances the efficiency of multi-hypothesis prediction.
Rate-Constrained Video Compression Using a 3-D Head Model
We show that traditional waveform-coding and 3-D model-based coding are not competing alternatives but should be combined to support and complement each other. Both approaches are combined such that the generality of waveform coding and the efficiency of 3-D model-based coding are available where needed. The combination is achieved by providing the block-based video coder with a second reference frame for prediction which is synthesized by the model-based coder. The model-based coder uses a parameterized 3-D head model specifying shape and color of a person. We therefore restrict our investigations to typical videotelephony scenarios that show head-and-shoulder scenes. Motion and deformation of the 3-D head model constitute facial expressions which are represented by facial animation parameters (FAPs) based on the MPEG-4 standard. An intensity gradient-based approach that exploits the 3-D model information is used to estima te the FAPs as well as illumination parameters that describe changes of the brightness in the scene. Model failures and objects that are not known at the decoder are handled by standard block-based motion-compensated prediction which is not restricted to a special scene content, but results in lower coding efficiency. A Lagrangian approach is employed to determine the most efficient prediction for each block from either the synthesized model frame or the previous decoded frame. Experiments on five video sequences show that bit-rate savings of about 35 % are achieved at equal average PSNR when comparing the model-aided codec to TMN-10, the state-of-the-art test model of the H.263 standard. This corresponds to a gain of 2-3 dB in PSNR when encoding at the same average bit-rate.
Above, frame 150 of the Akiyo sequence is depicted. The frame is coded using the TMN-10 and the MBBC at the same bit-rate, left image: TMN-10 (31.08 dB PSNR, 720 bits), right image: MBBC (33.19 dB PSNR, 725 bits). You can also take a look at the Quicktime movie of the sequence (5.6 Mb). My colleague,
Peter Eisert, has also generated a web page about the project.
(Kopie 1)
Image Communication
Fraunhofer Institute for Telecommunications
Heinrich-Hertz-Institut
Image Processing
Einsteinufer 37
10587 Berlin
Germany
