Editing MPEG-1 (section)

==Part 2: Video==
<!-- Link Twango to this section. -->
Part 2 of the MPEG-1 standard covers video and is defined in ISO/IEC-11172-2. The design was heavily influenced by [[H.261]].

MPEG-1 Video exploits perceptual compression methods to significantly reduce the data rate required by a video stream. It reduces or completely discards information in certain frequencies and areas of the picture that the human eye has limited ability to fully perceive. It also exploits temporal (over time) and spatial (across a picture) redundancy common in video to achieve better data compression than would be possible otherwise. (See: [[Video compression]])

===Color space===
[[File:Yuvformats420sampling.svg|thumb|right|Example of 4:2:0 subsampling. The two overlapping center circles represent chroma blue and chroma red (color) pixels, while the 4 outside circles represent the luma (brightness).]]

Before encoding video to MPEG-1, the color-space is transformed to [[Y′CbCr]] (Y′=Luma, Cb=Chroma Blue, Cr=Chroma Red). [[Luma (video)|Luma]] (brightness, resolution) is stored separately from [[Chrominance|chroma]] (color, hue, phase) and even further separated into red and blue components.

The chroma is also subsampled to [[4:2:0]], meaning it is reduced to half resolution vertically and half resolution horizontally, i.e., to just one quarter the number of samples used for the luma component of the video.<ref name=mpeg_faqs1/> This use of higher resolution for some color components is similar in concept to the [[Bayer filter|Bayer pattern filter]] that is commonly used for the image capturing sensor in digital color cameras. Because the human eye is much more sensitive to small changes in brightness (the Y component) than in color (the Cr and Cb components), [[chroma subsampling]] is a very effective way to reduce the amount of video data that needs to be compressed. However, on videos with fine detail (high [[Spatial frequency#Visual perception|spatial complexity]]) this can manifest as chroma [[aliasing]] artifacts. Compared to other digital [[compression artifact]]s, this issue seems to very rarely be a source of annoyance. Because of the subsampling, Y′CbCr 4:2:0 video is ordinarily stored using even dimensions ([[divisible]] by 2 horizontally and vertically).

Y′CbCr color is often informally called [[YUV]] to simplify the notation, although that term more properly applies to a somewhat different color format. Similarly, the terms [[luminance]] and [[chrominance]] are often used instead of the (more accurate) terms luma and chroma.

===Resolution/bitrate===
MPEG-1 supports resolutions up to 4095×4095 (12 bits), and bit rates up to 100&nbsp;Mbit/s.<ref name=bmrc_mpeg2_faq/>

MPEG-1 videos are most commonly seen using [[Source Input Format]] (SIF) resolution: 352×240, 352×288, or 320×240. These relatively low resolutions, combined with a bitrate less than 1.5&nbsp;Mbit/s, make up what is known as a [[constrained parameters bitstream]] (CPB), later renamed the "Low Level" (LL) profile in MPEG-2. This is the minimum video specifications any [[Codec|decoder]] should be able to handle, to be considered MPEG-1 [[wikt:compliant|compliant]]. This was selected to provide a good balance between quality and performance, allowing the use of reasonably inexpensive hardware of the time.<ref name=Didier_MPEG/><ref name=bmrc_mpeg2_faq/>

===Frame/picture/block types===
MPEG-1 has several frame/picture types that serve different purposes. The most important, yet simplest, is '''I-frame'''.

====I-frames====
"I-frame" is an abbreviation for "[[wikt:I-frame|Intra-frame]]", so-called because they can be decoded independently of any other frames. They may also be known as I-pictures, or keyframes due to their somewhat similar function to the [[key frame]]s used in animation. I-frames can be considered effectively identical to baseline [[JPEG]] images.<ref name=bmrc_mpeg2_faq/>

High-speed seeking through an MPEG-1 video is only possible to the nearest I-frame. When cutting a video it is not possible to start playback of a segment of video before the first I-frame in the segment (at least not without computationally intensive re-encoding). For this reason, I-frame-only MPEG videos are used in editing applications.

I-frame only compression is very fast, but produces very large file sizes: a factor of 3&times; (or more) larger than normally encoded MPEG-1 video, depending on how temporally complex a specific video is.<ref name=Didier_MPEG/> I-frame only MPEG-1 video is very similar to [[MJPEG]] video. So much so that very high-speed and theoretically lossless (in reality, there are rounding errors) conversion can be made from one format to the other, provided a couple of restrictions (color space and quantization matrix) are followed in the creation of the bitstream.<ref name=smith_transcoding>{{Citation |first1=Soam |last1=Acharya |first2=Brian |last2=Smith |title=Compressed Domain Transcoding of MPEG |pages=3 |year=1998 |publisher=[[Cornell University]], [[IEEE Computer Society]], [[IEEE]] International Conference on Multimedia Computing and Systems |url=http://citeseer.ist.psu.edu/acharya98compressed.html |access-date=2016-11-11 |url-status=live |archive-url=http://archive.wikiwix.com/cache/20110223164151/http://citeseer.ist.psu.edu/acharya98compressed.html |archive-date=2011-02-23 }} – (Requires clever reading: says quantization matrices differ, but those are just defaults, and selectable){{registration required|s}}</ref>

The length between I-frames is known as the [[group of pictures]] (GOP) size. MPEG-1 most commonly uses a GOP size of 15–18. i.e. 1 I-frame for every 14-17 non-I-frames (some combination of P- and B- frames). With more intelligent encoders, GOP size is dynamically chosen, up to some pre-selected maximum limit.<ref name=bmrc_mpeg2_faq/>

Limits are placed on the maximum number of frames between I-frames due to decoding complexing, decoder buffer size, recovery time after data errors, seeking ability, and accumulation of IDCT errors in low-precision implementations most common in hardware decoders (See: [[IEEE]]-1180).

====P-frames====
"P-frame" is an abbreviation for "Predicted-frame". They may also be called forward-predicted frames or [[wikt:inter-|inter-]]frames (B-frames are also inter-frames).

P-frames exist to improve compression by exploiting the [[wikt:temporal|temporal]] (over time) [[wikt:redundancy|redundancy]] in a video. P-frames store only the ''difference'' in image from the frame (either an I-frame or P-frame) immediately preceding it (this reference frame is also called the ''[[wikt:anchor|anchor]] frame'').

The difference between a P-frame and its anchor frame is calculated using ''motion vectors'' on each ''macroblock'' of the frame (see below). Such motion vector data will be embedded in the P-frame for use by the decoder.

A P-frame can contain any number of intra-coded blocks (DCT and Quantized), in addition to any forward-predicted blocks (Motion Vectors).<ref name=hp_transcoding/>

If a video drastically changes from one frame to the next (such as a [[cut (transition)|cut]]), it is more efficient to encode it as an I-frame.

====B-frames====
"B-frame" stands for "bidirectional-frame" or "bipredictive frame". They may also be known as backwards-predicted frames or B-pictures. B-frames are quite similar to P-frames, except they can make predictions using both the previous and future frames (i.e. two anchor frames).

It is therefore necessary for the player to first decode the next I- or P- anchor frame sequentially after the B-frame, before the B-frame can be decoded and displayed. This means decoding B-frames requires larger [[data buffer]]s and causes an increased delay on both decoding and during encoding. This also necessitates the decoding time stamps (DTS) feature in the container/system stream (see above). As such, B-frames have long been subject of much controversy, they are often avoided in videos, and are sometimes not fully supported by hardware decoders.

No other frames are predicted from a B-frame. Because of this, a very low bitrate B-frame can be inserted, where needed, to help control the bitrate. If this was done with a P-frame, future P-frames would be predicted from it and would lower the quality of the entire sequence. However, similarly, the future P-frame must still encode all the changes between it and the previous I- or P- anchor frame. B-frames can also be beneficial in videos where the background behind an object is being revealed over several frames, or in fading transitions, such as scene changes.<ref name=Didier_MPEG/><ref name=bmrc_mpeg2_faq/>

A B-frame can contain any number of intra-coded blocks and forward-predicted blocks, in addition to backwards-predicted, or bidirectionally predicted blocks.<ref name=bmrc_mpeg2_faq/><ref name=hp_transcoding/>

====D-frames====
MPEG-1 has a unique frame type not found in later video standards. "D-frames" or DC-pictures are independently coded images (intra-frames) that have been encoded using DC transform coefficients only (AC coefficients are removed when encoding D-frames&mdash;see DCT below) and hence are very low quality. D-frames are never referenced by I-, P- or B- frames. D-frames are only used for fast previews of video, for instance when seeking through a video at high speed.<ref name=Didier_MPEG/>

Given moderately higher-performance decoding equipment, fast preview can be accomplished by decoding I-frames instead of D-frames. This provides higher quality previews, since I-frames contain AC coefficients as well as DC coefficients. If the encoder can assume that rapid I-frame decoding capability is available in decoders, it can save bits by not sending D-frames (thus improving compression of the video content). For this reason, D-frames are seldom actually used in MPEG-1 video encoding, and the D-frame feature has not been included in any later video coding standards.

===Macroblocks===
{{Main|Macroblock}}

MPEG-1 operates on video in a series of 8×8 blocks for quantization. However, to reduce the bit rate needed for motion vectors and because chroma (color) is subsampled by a factor of 4, each pair of (red and blue) chroma blocks corresponds to 4 different luma blocks. That is, for 4 luma blocks of size 8x8, there is one Cb block of 8x8 and one Cr block of 8x8. This set of 6 blocks, with a picture resolution of 16×16, is processed together and called a ''macroblock''.

All of these 8x8 blocks are independently put through DCT and quantization.

A macroblock is the smallest independent unit of (color) video. Motion vectors (see below) operate solely at the macroblock level.

If the height or width of the video are not exact [[wikt:multiple|multiples]] of 16, full rows and full columns of macroblocks must still be encoded and decoded to fill out the picture (though the extra decoded pixels are not displayed).

===Motion vectors===
To decrease the amount of temporal redundancy in a video, only blocks that change are updated, (up to the maximum GOP size). This is known as conditional replenishment. However, this is not very effective by itself. Movement of the objects, and/or the camera may result in large portions of the frame needing to be updated, even though only the position of the previously encoded objects has changed. Through motion estimation, the encoder can compensate for this movement and remove a large amount of redundant information.

The encoder compares the current frame with adjacent parts of the video from the anchor frame (previous I- or P- frame) in a diamond pattern, up to a (encoder-specific) predefined [[radius]] limit from the area of the current macroblock. If a match is found, only the direction and distance (i.e. the [[wikt:vector|''vector'']] of the ''motion'') from the previous video area to the current macroblock need to be encoded into the inter-frame (P- or B- frame). The reverse of this process, performed by the decoder to reconstruct the picture, is called [[motion compensation]].

A predicted macroblock rarely matches the current picture perfectly, however. The differences between the estimated matching area, and the real frame/macroblock is called the prediction error. The larger the amount of prediction error, the more data must be additionally encoded in the frame. For efficient video compression, it is very important that the encoder is capable of effectively and precisely performing motion estimation.

Motion vectors record the ''distance'' between two areas on screen based on the number of pixels (also called pels). MPEG-1 video uses a motion vector (MV) precision of one half of one pixel, or half-pel. The finer the precision of the MVs, the more accurate the match is likely to be, and the more efficient the compression. There are trade-offs to higher precision, however. Finer MV precision results in using a larger amount of data to represent the MV, as larger numbers must be stored in the frame for every single MV, increased coding complexity as increasing levels of interpolation on the macroblock are required for both the encoder and decoder, and [[wikt:law of diminishing returns|diminishing returns]] (minimal gains) with higher precision MVs. Half-pel precision was chosen as the ideal trade-off for that point in time. (See: [[qpel]])

Because neighboring macroblocks are likely to have very similar motion vectors, this redundant information can be compressed quite effectively by being stored [[Pulse-code modulation|DPCM]]-encoded. Only the (smaller) amount of difference between the MVs for each macroblock needs to be stored in the final bitstream.

P-frames have one motion vector per macroblock, relative to the previous anchor frame. B-frames, however, can use two motion vectors; one from the previous anchor frame, and one from the future anchor frame.<ref name=hp_transcoding>{{Citation |first1=Susie J. |last1=Wee |first2=Bhaskaran |last2=Vasudev |first3=Sam |last3=Liu |title=Transcoding MPEG Video Streams in the Compressed Domain |date=March 13, 1997 |publisher=[[Hewlett-Packard]] |url=http://www.hpl.hp.com/personal/Susie_Wee/PAPERS/hpidc97/hpidc97.html |access-date=2016-11-11 |archive-url=https://web.archive.org/web/20070817191927/http://www.hpl.hp.com/personal/Susie_Wee/PAPERS/hpidc97/hpidc97.html |archive-date=2007-08-17|citeseerx=10.1.1.24.633 }}</ref>

Partial macroblocks, and black borders/bars encoded into the video that do not fall exactly on a macroblock boundary, cause havoc with motion prediction. The block padding/border information prevents the macroblock from closely matching with any other area of the video, and so, significantly larger prediction error information must be encoded for every one of the several dozen partial macroblocks along the screen border. DCT encoding and quantization (see below) also isn't nearly as effective when there is large/sharp picture contrast in a block.

An even more serious problem exists with macroblocks that contain significant, random, ''edge noise'', where the picture transitions to (typically) black. All the above problems also apply to edge noise. In addition, the added randomness is simply impossible to compress significantly. All of these effects will lower the quality (or increase the bitrate) of the video substantially.

===DCT===
Each 8×8 block is encoded by first applying a ''forward'' [[discrete cosine transform]] (FDCT) and then a quantization process. The FDCT process (by itself) is theoretically lossless, and can be reversed by applying an ''Inverse'' DCT ([[IDCT]]) to reproduce the original values (in the absence of any quantization and rounding errors). In reality, there are some (sometimes large) rounding errors introduced both by quantization in the encoder (as described in the next section) and by IDCT approximation error in the decoder. The minimum allowed accuracy of a decoder IDCT approximation is defined by ISO/IEC 23002-1. (Prior to 2006, it was specified by [[IEEE 1180]]-1990.)

The FDCT process converts the 8×8 block of uncompressed pixel values (brightness or color difference values) into an 8×8 indexed array of ''frequency coefficient'' values. One of these is the (statistically high in variance) "DC coefficient", which represents the average value of the entire 8×8 block. The other 63 coefficients are the statistically smaller "AC coefficients", which have positive or negative values each representing sinusoidal deviations from the flat block value represented by the DC coefficient.

An example of an encoded 8×8 FDCT block: 
:<math>
\begin{bmatrix}
 -415 & -30 & -61 & 27 & 56 & -20 & -2 & 0 \\
 4 & -22 & -61 & 10 & 13 & -7 & -9 & 5 \\
 -47 & 7 & 77 & -25 & -29 & 10 & 5 & -6 \\
 -49 & 12 & 34 & -15 & -10 & 6 & 2 & 2 \\
 12 & -7 & -13 & -4 & -2 & 2 & -3 & 3 \\
 -8 & 3 & 2 & -6 & -2 & 1 & 4 & 2 \\
 -1 & 0 & 0 & -2 & -1 & -3 & 4 & -1 \\
 0 & 0 & -1 & -4 & -1 & 0 & 1 & 2
\end{bmatrix}
</math>

Since the DC coefficient value is statistically correlated from one block to the next, it is compressed using [[Pulse-code modulation|DPCM]] encoding. Only the (smaller) amount of difference between each DC value and the value of the DC coefficient in the block to its left needs to be represented in the final bitstream.

Additionally, the frequency conversion performed by applying the DCT provides a statistical decorrelation function to efficiently concentrate the signal into fewer high-amplitude values prior to applying quantization (see below).

===Quantization===
[[Quantization (image processing)|Quantization]] is, essentially, the process of reducing the accuracy of a signal, by dividing it by some larger step size and rounding to an integer value (i.e. finding the nearest multiple, and discarding the remainder).

The frame-level quantizer is a number from 0 to 31 (although encoders will usually omit/disable some of the extreme values) which determines how much information will be removed from a given frame. The frame-level quantizer is typically either dynamically selected by the encoder to maintain a certain user-specified bitrate, or (much less commonly) directly specified by the user.

A "quantization matrix" is a string of 64 numbers (ranging from 0 to 255) which tells the encoder how relatively important or unimportant each piece of visual information is. Each number in the matrix corresponds to a certain frequency component of the video image.

An example quantization matrix:
:<math>
\begin{bmatrix}
 16 & 11 & 10 & 16 & 24 & 40 & 51 & 61 \\
 12 & 12 & 14 & 19 & 26 & 58 & 60 & 55 \\
 14 & 13 & 16 & 24 & 40 & 57 & 69 & 56 \\
 14 & 17 & 22 & 29 & 51 & 87 & 80 & 62 \\
 18 & 22 & 37 & 56 & 68 & 109 & 103 & 77 \\
 24 & 35 & 55 & 64 & 81 & 104 & 113 & 92 \\
 49 & 64 & 78 & 87 & 103 & 121 & 120 & 101 \\
 72 & 92 & 95 & 98 & 112 & 100 & 103 & 99
\end{bmatrix}
</math>

Quantization is performed by taking each of the 64 ''frequency'' values of the DCT block, dividing them by the frame-level quantizer, then dividing them by their corresponding values in the quantization matrix. Finally, the result is rounded down. This significantly reduces, or completely eliminates, the information in some frequency components of the picture. Typically, high frequency information is less visually important, and so high frequencies are much more ''strongly quantized'' (drastically reduced). MPEG-1 actually uses two separate quantization matrices, one for intra-blocks (I-blocks) and one for inter-block (P- and B- blocks) so quantization of different block types can be done independently, and so, more effectively.<ref name=Didier_MPEG/>

This quantization process usually reduces a significant number of the ''AC coefficients'' to zero, (known as [[wikt:sparse|sparse]] data) which can then be more efficiently compressed by entropy coding (lossless compression) in the next step.

An example quantized DCT block:
:<math>
\begin{bmatrix}
 -26 & -3 & -6 & 2 & 2 & -1 & 0 & 0 \\
 0 & -2 & -4 & 1 & 1 & 0 & 0 & 0 \\
 -3 & 1 & 5 & -1 & -1 & 0 & 0 & 0 \\
 -4 & 1 & 2 & -1 & 0 & 0 & 0 & 0 \\
 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0
\end{bmatrix}
</math>

Quantization eliminates a large amount of data, and is the main lossy processing step in MPEG-1 video encoding. This is also the primary source of most MPEG-1 video [[compression artifacts]], like [[blockiness]], [[color banding]], [[noise]], [[Ringing (signal)|ringing]], [[discoloration]], etc. This happens when video is encoded with an insufficient bitrate, and the encoder is therefore forced to use high frame-level quantizers (''strong quantization'') through much of the video.

===Entropy coding===
Several steps in the encoding of MPEG-1 video are lossless, meaning they will be reversed upon decoding, to produce exactly the same (original) values. Since these lossless data compression steps don't add noise into, or otherwise change the contents (unlike quantization), it is sometimes referred to as [[Source coding theorem|noiseless coding]].<ref name=mpeg1_audio/> Since lossless compression aims to remove as much redundancy as possible, it is known as [[entropy coding]] in the field of [[information theory]].

The coefficients of quantized DCT blocks tend to zero towards the bottom-right. Maximum compression can be achieved by a zig-zag scanning of the DCT block starting from the top left and using Run-length encoding techniques.

The DC coefficients and motion vectors are [[Pulse-code modulation|DPCM]]-encoded.

[[Run-length encoding]] (RLE) is a simple method of compressing repetition. A sequential string of characters, no matter how long, can be replaced with a few bytes, noting the value that repeats, and how many times. For example, if someone were to say "five nines", you would know they mean the number: 99999.

RLE is particularly effective after quantization, as a significant number of the AC coefficients are now zero (called [[wikt:sparse|sparse]] data), and can be represented with just a couple of bytes. This is stored in a special 2-[[dimensional]] Huffman table that codes the run-length and the run-ending character.

[[Huffman Coding]] is a very popular and relatively simple method of entropy coding, and used in MPEG-1 video to reduce the data size. The data is analyzed to find strings that repeat often. Those strings are then put into a special table, with the most frequently repeating data assigned the shortest code. This keeps the data as small as possible with this form of compression.<ref name=mpeg1_audio/> Once the table is constructed, those strings in the data are replaced with their (much smaller) codes, which reference the appropriate entry in the table. The decoder simply reverses this process to produce the original data.

This is the final step in the video encoding process, so the result of [[Huffman coding]] is known as the MPEG-1 video "bitstream."

===GOP configurations for specific applications===
I-frames store complete frame info within the frame and are therefore suited for random access. P-frames provide compression using motion vectors relative to the previous frame ( I or P ). B-frames provide maximum compression but require the previous as well as next frame for computation. Therefore, processing of B-frames requires more buffer on the decoded side. A configuration of the [[Group of Pictures]] (GOP) should be selected based on these factors. I-frame only sequences give least compression, but are useful for random access, FF/FR and editability. I- and P-frame sequences give moderate compression but add a certain degree of random access, FF/FR functionality. I-, P- and B-frame sequences give very high compression but also increase the coding/decoding delay significantly. Such configurations are therefore not suited for video-telephony or video-conferencing applications.

The typical data rate of an I-frame is 1 bit per pixel while that of a P-frame is 0.1 bit per pixel and for a B-frame, 0.015 bit per pixel.<ref>{{cite web|url=http://bmrc.berkeley.edu/frame/research/mpeg/mpeg_overview.html |title=BMRC |access-date=2009-05-03 |url-status=dead |archive-url=https://web.archive.org/web/20090503020732/http://bmrc.berkeley.edu/frame/research/mpeg/mpeg_overview.html |archive-date=2009-05-03 }}</ref>