To reach a high compression rate, images are compared by the way in which they differ from each other. If a following image is not changed at all to the one before, nothing needs to be freshly encoded. Only the information that nothing has changed is needed. The first image is a so-called intra-coded picture, an I-Frame. To analyse differences between images, in the encoder every pixel could be compared with the same pixel of the next image. Moving parts within the image are analysed and will be described by motion vectors. To predict the new position of moving objects in the next images the motion vectors are used. Prediction errors can be eliminated by comparing the predicted image with the actual picture. The image is fragmented in macro blocks with each having its own motion vector. Interframe and intraframe compression technology is combined. GOP technology is not related to MPEG only.
The group of pictures is the sequence of images with inter-frame compression, there information of multiple images is packed together. The group of pictures can have any length. Usual groups persist of six, 12 or 15 images. The sequence begins always with an I-frame (intra-coded) as an anchor of the sequence. The next images are forward predicted images, P-frames. The first P frame is decoded using the I frame as a basis, with predicted moving parts defined by motion vectors. Each group of pictures starts with an I frame. The more I-frames a GOP has the less problematic is video editing. Cutting at the position of an B-frame needs further processing to supply all necessary image information.
I-Frame: reference image with full image information independent of all other image types, without references to other frames. Always an I-frame is the starting point of a GOP.
P-Frame: predictive, the frame only 'looks backward' to I-frames and P-frames, the frame contains only information of image parts that are different to the previous I-frames or P-frames, they are motion-compensated.
B-Frame: bi-directionally, the frame 'looks forward and backward' to I-frames and P-frames, it is bi-directional interpolated, image information of the frame is built on previous and following I-frames and/or P-frames. (Because of 'looking at following' frames, a certain processing delay is necessary.)
B-frames are the most efficient frames but the hardest to decode.
The typical GOP size for broadcast applications is 30 frames. IPTV uses typically more than 30 frames and streaming video often uses more than 300 frames.
A typical GOP image sequence at p50 with 6 images covers a time of 6 frames of each 20 msec length = 120 msec. A 12-images sequence covers a time 240 msec, about a quarter of a second.
Because of the necessary forward and backward processing the GOP encoding causes a quite large processing delay. This is not caused by the processor itself but by the need to 'look into the future' of the next frames to calculate the image content properly.
|GOP with two P-frames and six B-frames|
|theoretical image sequence with only four frames: two I-frames and two B-Frames between|