Hierarchical Prediction Structures

Hierarchical prediction structures provide:

The increased flexibility of H.264/AVC in comparison to older video coding standard as MPEG-2 Video, H.263, or MPEG-4 Visual is one of the main reasons for its improved coding efficiency. The flexibility at a macroblock level, which is significantly increased by the support of various intra prediction modes as well as much more macroblock partitioning modes for motion-compensated prediction than in any older video coding standard, has been well studied. However, H.264/AVC also supports much more flexibility at a picture/sequence level. In contrast to older video coding standards, the coding and display order of pictures is completely decoupled. Furthermore any picture can be marked as reference picture and used for motion-compensated prediction of following pictures independent of the coding types of the corresponding slices. The behavior of the decoded picture buffer (DPB), which can hold up to 16 pictures, can be adaptively controlled by memory management control operation (MMCO) commands, and the reference pictures of the DPB that are used for motion-compensated prediction of another picture can be arbitrarily selected via reference picture list modification (RPLM) commands.

These features of H.264/AVC allow the selection of arbitrary coding/prediction structures, which are not possible with older video coding standards. The development of the first Working Draft for the SVC standard turned out that the coding efficiency of H.264/AVC can be increased in comparison to the classical “IPPP…” or “IBBP…” coding, when prediction structures with so-called hierarchical B pictures are used.

A typical hierarchical prediction structure with 4 dyadic hierarchy stages is depicted in Figure 1(a). The first picture of a video sequence is intra-coded as IDR picture; so-called key pictures (red in Figure 1) are coded in regular (or even irregular) intervals. At this, a picture is called a key picture when all previously coded pictures precede this picture in display order. As illustrated in Figure 1(a), a key picture and all pictures that are temporally located between the key picture and the previous key picture (the IDR picture at the beginning of a video sequence is also a key picture) are considered to build a group of pictures (GOP). The key pictures are either intra-coded (e.g., in order to enable random access) or inter-coded using previous (key) pictures as reference for motion-compensated prediction. The remaining pictures of a GOP are hierarchically predicted as illustrated in Figure 1(a).

It is obvious that such a hierarchical prediction structure can also be employed for supporting several temporal scalability levels. Therefore, it has to be ensured that all pictures are predicted by using only pictures of a coarser or the same temporal level as references (cp. Figure 1). Then the sequence of key pictures represents the coarsest supported temporal resolution, and this temporal resolution can be refined by adding the sub-sequences of the next temporal levels. The dyadic hierarchy as depicted in Figure 1(a), in which key pictures are only predicted from other key pictures, and the non-key pictures are predicted by using only the nearest pictures of the lower temporal level from the past and the future, can always be used for providing temporal scalability.

The usage of hierarchical coding structures is of course not restricted to the dyadic case. Figure 1(b) illustrates an example of a non-dyadic hierarchical prediction structure with 3 temporal levels and a GOP size of 9 pictures. Furthermore, by restricting the prediction from pictures that follow the current picture in display order and adjusting the coding order of pictures, the delay can be controlled. As an example Figure 1(c) show a hierarchical prediction structures with a structural delay of zero, which can also be used for delay-critical applications.

In the examples of Figure 1, key pictures are either intra-coded or predicted using only previous key pictures as references, and the non-key pictures are predicted using only the nearest pictures of the lower temporal level from the past and the future as references. It should be noted that the maximum number of reference pictures that were utilized to code a picture is 2. However, the coding efficiency can be further improved when more pictures are included in the corresponding reference pictures lists, and the multiple reference picture concept of H.264/AVC is used for motion-compensated prediction. When temporal scalability should be supported, the reference picture lists have to be selected in a way (by appropriate reference picture re-ordering commands) that only pictures that belong to a coarser or the same temporal level as the current picture are included in the reference picture lists; otherwise, a more general selection of the used reference pictures is possible.

Figure 2 illustrates the coding efficiency improvements for hierarchical prediction structures. It shows the bit rates savings for different prediction structures (including “IBPBP…” and “IBBP…”) relative to simple “IPPP” coding for a constant video quality of 34 dB. For these simulations, hierarchical prediction structures with GOPs of 8 or more pictures provide on average 10% bit rate savings relative to the often used “IBBP…” coding structure.