Hierarchical Prediction Structures

Hierarchical prediction structures provide:

  • Improved coding efficiency
  • Temporal scalability

A video bit stream is called temporal scalable when parts of the stream can be removed in a way that the resulting substream forms another valid bit stream for some target decoder, and the substream represents the source content with a frame rate that is smaller than the frame rate of the complete original bit stream (see  Figure 1).

In contrast to older video coding standards, the coding and display order of pictures is completely decoupled in H.264/MPEG-4 AVC. Furthermore any picture can be marked as reference picture and used for motion-compensated prediction of following pictures independent of the coding types of the corresponding slices.

These features allow the selection of arbitrary coding/prediction structures, which are not possible with older video coding standards. The development of the first Working Draft for the SVC standard turned out that the coding efficiency of H.264/MPEG-4 AVC can be increased, when prediction structures with so-called hierarchical B pictures are used.

A typical hierarchical prediction structure with 4 dyadic hierarchy stages is depicted in Figure 2(a). So-called key pictures (red in Figure 2) are coded in regular (or even irregular) intervals. As illustrated, a key picture and all pictures that are temporally located between the key picture and the previous key picture are considered to build a group of pictures (GOP). The key pictures are either intra-coded (e.g., in order to enable random access) or inter-coded using previous (key) pictures as reference for motion-compensated prediction. The remaining pictures of a GOP are hierarchically predicted as illustrated in Figure 2(a).

It is obvious that such a hierarchical prediction structure can also be employed for supporting several temporal scalability levels. Therefore, it has to be ensured that all pictures are predicted by using only pictures of a coarser or the same temporal level as references.

Figure 2(b) illustrates an example of a non-dyadic hierarchical prediction structure with 3 temporal levels and a GOP size of 9 pictures. Furthermore, by restricting the prediction from pictures that follow the current picture in display order and adjusting the coding order of pictures, the delay can be controlled. As an example Figure 2(c) show a hierarchical prediction structures with a structural delay of zero, which can also be used for delay-critical applications.

Figure 3 illustrates the coding efficiency improvements for hierarchical prediction structures. It shows the bit rates savings for different prediction structures (including “IBPBP…” and “IBBP…”) relative to simple “IPPP” coding for a constant video quality of 34 dB. For these simulations, hierarchical prediction structures with GOPs of 8 or more pictures provide on average 10 % bit rate savings relative to the often used “IBBP…” coding structure.