Editing MP3 (section)

== Design ==

=== File structure ===
{{Panorama
|image   = File:Mp3filestructure.svg
|height  = 400
|alt     = Diagram of the structure of an MP3 file
|caption = Diagram of the structure of an MP3 file (MPEG version 2.5, not described here, changes the last bit of sync word to "0" as an indication, effectively moving one bit to the version field<ref name="MPEG-2.5-2" />).
}}

An MP3 file is made up of MP3 frames, which consist of a header and a data block. This sequence of frames is called an [[elementary stream]]. Due to the "bit reservoir", frames are not independent items and cannot usually be extracted on arbitrary frame boundaries. The MP3 Data blocks contain the (compressed) audio information in terms of frequencies and amplitudes. The diagram shows that the MP3 Header consists of a [[sync word]], which is used to identify the beginning of a valid frame. This is followed by a bit indicating that this is the [[MPEG]] standard and two bits that indicate that layer 3 is used; hence MPEG-1 Audio Layer 3 or MP3. After this, the values will differ, depending on the MP3 file. ''ISO/IEC 11172-3'' defines the range of values for each section of the header along with the specification of the header. Most MP3 files today contain [[ID3]] [[metadata]], which precedes or follows the MP3 frames, as noted in the diagram. The data stream can contain an optional [[checksum]].

[[Joint stereo]] is done only on a frame-to-frame basis.<ref name="Limitations"/>

=== Encoding and decoding ===
In short, MP3 compression works by reducing the accuracy of certain components of sound that are considered (by psychoacoustic analysis) to be beyond the [[Hearing range#Humans|hearing capabilities]] of most humans. This method is commonly referred to as perceptual coding or [[psychoacoustic]] modeling.<ref name="Jayant1993" /> The remaining audio information is then recorded in a space-efficient manner using [[MDCT]] and [[FFT]] algorithms.

The MP3 encoding algorithm is generally split into four parts. Part 1 divides the audio signal into smaller pieces, called frames, and an MDCT filter is then performed on the output. Part 2 passes the sample into a 1024-point [[fast Fourier transform]] (FFT), then the [[psychoacoustic]] model is applied and another MDCT filter is performed on the output. Part 3 quantifies and encodes each sample, known as noise allocation, which adjusts itself to meet the bit rate and [[sound masking]] requirements. Part 4 formats the [[bitstream]], called an audio frame, which is made up of 4 parts, the [[Header (computing)|header]], [[Error checking|error check]], [[audio data]], and [[#Ancillary data|ancillary data]].<ref name="Guckert"/>

The [[MPEG-1]] standard does not include a precise specification for an MP3 encoder but does provide examples of psychoacoustic models, rate loops, and the like in the non-normative part of the original standard.<ref name="mpeg1" /> MPEG-2 doubles the number of sampling rates that are supported and MPEG-2.5 adds 3 more. When this was written, the suggested implementations were quite dated. Implementers of the standard were supposed to devise algorithms suitable for removing parts of the information from the audio input. As a result, many different MP3 encoders became available, each producing files of differing quality. Comparisons were widely available, so it was easy for a prospective user of an encoder to research the best choice. Some encoders that were proficient at encoding at higher bit rates (such as [[LAME]]) were not necessarily as good at lower bit rates. Over time, LAME evolved on the SourceForge website until it became the de facto CBR MP3 encoder. Later an ABR mode was added. Work progressed on true [[variable bit rate]] using a quality goal between 0 and 10. Eventually, numbers (such as -V 9.600) could generate excellent quality low bit rate voice encoding at only {{nowrap|41 kbit/s}} using the MPEG-2.5 extensions.

MP3 uses an overlapping MDCT structure. Each MPEG-1 MP3 frame is 1152 samples, divided into two granules of 576 samples. These samples, initially in the time domain, are transformed in one block to 576 [[Fourier Transform|frequency-domain samples]] by MDCT.<ref>{{cite web |last=Taylor |first=Mark |date=June 2000 |title=LAME Technical FAQ |url=https://lame.sourceforge.io/tech-FAQ.txt |access-date=9 December 2023 |archive-date=8 December 2023 |archive-url=https://web.archive.org/web/20231208232048/https://lame.sourceforge.io/tech-FAQ.txt |url-status=live }}</ref> MP3 also allows the use of shorter blocks in a granule, down to a size of 192 samples; this feature is used when a [[Transient (acoustics)|transient]] is detected. Doing so limits the temporal spread of quantization noise accompanying the transient (see [[psychoacoustics]]). Frequency resolution is limited by the small long block window size, which decreases coding efficiency.<ref name="Limitations"/> Time resolution can be too low for highly transient signals and may cause smearing of percussive sounds.<ref name="Limitations" />

Due to the tree structure of the filter bank, pre-echo problems are made worse, as the combined impulse response of the two filter banks does not, and cannot, provide an optimum solution in time/frequency resolution.<ref name="Limitations"/> Additionally, the combining of the two filter banks' outputs creates aliasing problems that must be handled partially by the "aliasing compensation" stage; however, that creates excess energy to be coded in the frequency domain, thereby decreasing coding efficiency.<ref>{{Cite book|last=Liberman|first=Serbio|title=DSP - The Technology Behind Multimedia|language=English}}</ref>

Decoding, on the other hand, is carefully defined in the standard. Most [[Codec|decoders]] are "[[Elementary stream|bitstream]] compliant", which means that the decompressed output that they produce from a given MP3 file will be the same, within a specified degree of [[rounding]] tolerance, as the output specified mathematically in the ISO/IEC high standard document (ISO/IEC 11172-3). Therefore, the comparison of decoders is usually based on how computationally efficient they are (i.e., how much [[computer memory|memory]] or [[CPU]] time they use in the decoding process). Over time this concern has become less of an issue as [[CPU clock rate]]s transitioned from MHz to GHz. Encoder/decoder overall delay is not defined, which means there is no official provision for [[gapless playback]]. However, some encoders such as LAME can attach additional metadata that will allow players that can handle it to deliver seamless playback.

=== Quality ===
When performing lossy audio encoding, such as creating an MP3 data stream, there is a trade-off between the amount of data generated and the sound quality of the results. The person generating an MP3 selects a bit rate, which specifies how many [[kilobits]] per second of audio is desired. The higher the bit rate, the larger the MP3 data stream will be, and, generally, the closer it will sound to the original recording. With too low a bit rate, [[compression artifact]]s (i.e., sounds that were not present in the original recording) may be audible in the reproduction. Some audio is hard to compress because of its randomness and sharp attacks. When this type of audio is compressed, artifacts such as ringing or [[pre-echo]] are usually heard. A sample of applause or a [[Triangle (musical instrument)|triangle instrument]] with a relatively low bit rate provides good examples of compression artifacts. Most subjective testings of perceptual codecs tend to avoid using these types of sound materials, however, the artifacts generated by percussive sounds are barely perceptible due to the specific temporal masking feature of the 32 sub-band filterbank of Layer II on which the format is based.

Besides the bit rate of an encoded piece of audio, the quality of MP3-encoded sound also depends on the quality of the encoder algorithm as well as the complexity of the signal being encoded. As the MP3 standard allows quite a bit of freedom with encoding algorithms, different encoders do feature quite different quality, even with identical bit rates. As an example, in a public listening test featuring two early MP3 encoders set at about {{nowrap|128 kbit/s}},<ref name="Amorim" /> one scored 3.66 on a 1–5 scale, while the other scored only 2.22. Quality is dependent on the choice of encoder and encoding parameters.<ref name="listening-test-128-2006" />

This observation caused a revolution in audio encoding. Early on bit rate was the prime and only consideration. At the time MP3 files were of the very simplest type: they used the same bit rate for the entire file: this process is known as [[constant bit rate]] (CBR) encoding. Using a constant bit rate makes encoding simpler and less CPU-intensive. However, it is also possible to optimize the size of the file by creating files where the bit rate changes throughout the file. These are known as variable bit rate. The bit reservoir and VBR encoding were part of the original MPEG-1 standard. The concept behind them is that, in any piece of audio, some sections are easier to compress, such as silence or music containing only a few tones, while others will be more difficult to compress. So, the overall quality of the file may be increased by using a lower bit rate for the less complex passages and a higher one for the more complex parts. With some advanced MP3 encoders, it is possible to specify a given quality, and the encoder will adjust the bit rate accordingly. Users that desire a particular "quality setting" that is [[Transparency (data compression)|transparent]] to their ears can use this value when encoding all of their music, and generally speaking not need to worry about performing personal listening tests on each piece of music to determine the correct bit rate.

Perceived quality can be influenced by the listening environment (ambient noise), listener attention, listener training, and in most cases by listener audio equipment (such as sound cards, speakers, and headphones). Furthermore, sufficient quality may be achieved by a lesser quality setting for lectures and human speech applications and reduces encoding time and complexity. A test given to new students by [[Stanford University]] Music Professor Jonathan Berger showed that student preference for MP3-quality music has risen each year. Berger said the students seem to prefer the 'sizzle' sounds that MP3s bring to music.<ref name="Dougherty"/>

An in-depth study of MP3 audio quality, sound artist and composer [[Ryan Maguire]]'s project "The Ghost in the MP3" isolates the sounds lost during MP3 compression. In 2015, he released the track "moDernisT" (an anagram of "Tom's Diner"), composed exclusively from the sounds deleted during MP3 compression of the song "Tom's Diner",<ref name="noisey" /><ref name="schroeder2015" /><ref name="hull2015" /> the track originally used in the formulation of the MP3 standard. A detailed account of the techniques used to isolate the sounds deleted during MP3 compression, along with the conceptual motivation for the project, was published in the 2014 Proceedings of the International Computer Music Conference.<ref name="Maguire2014" />

=== Bit rate ===
{| class="wikitable infobox"
|+MPEG Audio Layer III<br />available bit rates ({{nowrap|kbit/s}})<ref name="neuron2-cd-1991" /><ref name="MPEG-2.5" /><ref name="MPEG-2.5-2" /><ref name="mp3tech-iso13818-3" /><ref>{{cite web |title=Guide to command line options (in CVS) |url=https://lame.cvs.sourceforge.net/viewvc/lame/lame/USAGE |url-status=dead |archive-url=https://web.archive.org/web/20130408110355/http://lame.cvs.sourceforge.net/viewvc/lame/lame/USAGE |archive-date=8 April 2013 |access-date=4 August 2010}}</ref>
|-
! MPEG-1<br />Audio Layer III
! MPEG-2<br />Audio Layer III
! MPEG-2.5<br />Audio Layer III
|-
| –
| 8
| 8
|-
| –
| 16
| 16
|-
| –
| 24
| 24
|-
| 32
| 32
| 32
|-
| 40
| 40
| 40
|-
| 48
| 48
| 48
|-
| 56
| 56
| 56
|-
| 64
| 64
| 64
|-
| 80
| 80
| –
|-
| 96
| 96
| –
|-
| 112
| 112
| –
|-
| 128
| 128
| –
|-
| –

| 144
| –
|-
| 160
| 160
| –
|-
| 192
| –
| –
|-
| 224
| –
| –
|-
| 256
| –
| –
|-
| 320
| –
| –
|}

{| class="wikitable infobox"
|+Supported sampling rates<br />by MPEG Audio Format<ref name="neuron2-cd-1991" /><ref name="MPEG-2.5" /><ref name="MPEG-2.5-2" /><ref name="mp3tech-iso13818-3" />
|-
! MPEG-1<br />Audio Layer III
! MPEG-2<br />Audio Layer III
! MPEG-2.5<br />Audio Layer III
|-
| –
| –
| 8&nbsp;kHz
|-
| –
| –
| 11.025&nbsp;kHz
|-
| –
| –
| 12&nbsp;kHz
|-
| –
| 16&nbsp;kHz
| –
|-
| –
| 22.05&nbsp;kHz
| –
|-
| –
| 24&nbsp;kHz
| –
|-
| 32&nbsp;kHz
| –
| –
|-
| 44.1&nbsp;kHz
| –
| –
|-
| 48&nbsp;kHz
| –
| –
|}
{{more citations needed section|date=July 2020}}
Bit rate is the product of the sample rate and number of bits per sample used to encode the music. CD audio is 44100 samples per second. The number of bits per sample also depends on the number of audio channels. The CD is stereo and 16 bits per channel. So, multiplying 44100 by 32 gives 1411200—the bit rate of uncompressed CD digital audio. MP3 was designed to encode this {{nowrap|1411 kbit/s}} data at {{nowrap|320 kbit/s}} or less. If less complex passages are detected by the MP3 algorithms then lower bit rates may be employed. When using MPEG-2 instead of MPEG-1, MP3 supports only lower sampling rates (16,000, 22,050, or 24,000 samples per second) and offers choices of bit rate as low as {{nowrap|8 kbit/s}} but no higher than {{nowrap|160 kbit/s}}. By lowering the sampling rate, MPEG-2 layer III removes all frequencies above half the new sampling rate that may have been present in the source audio.

As shown in these two tables, 14 selected bit rates are allowed in MPEG-1 Audio Layer III standard: 32, 40, 48, 56, 64, 80, 96, 112, 128, 160, 192, 224, 256 and {{nowrap|320 kbit/s}}, along with the 3 highest available sampling rates of 32, 44.1 and 48&nbsp;[[kHz]].<ref name="MPEG-2.5-2" /> MPEG-2 Audio Layer III also allows 14 somewhat different (and mostly lower) bit rates of 8, 16, 24, 32, 40, 48, 56, 64, 80, 96, 112, 128, 144, {{nowrap|160 kbit/s}} with sampling rates of 16, 22.05 and 24&nbsp;[[kHz]] which are exactly half that of MPEG-1.<ref name="MPEG-2.5-2" /> MPEG-2.5 Audio Layer III frames are limited to only 8 bit rates of 8, 16, 24, 32, 40, 48, 56 and {{nowrap|64 kbit/s}} with 3 even lower sampling rates of 8, 11.025, and 12&nbsp;kHz.{{Citation needed|reason=Based on results from the LAME encoder, these do seem to be the actual bit rates supported by MPEG-2.5, but official documents claim MPEG-2.5 has the same possible bit rates as MPEG-2. Answer: Bitrate switching implies VBR so, it is not CBR anymore. When MPEG-2 frames are needed instead of the smaller 2.5 frames, the former are generated. Can we find a source that mentions this limitation?|date=December 2013}} On earlier systems that only support the MPEG-1 Audio Layer III standard, MP3 files with a bit rate below {{nowrap|32 kbit/s}} might be played back sped-up and pitched-up.

Earlier systems also lack [[fast forward]]ing and rewinding playback controls on MP3.<ref>{{cite web|quote=Search – locating a desired position on thedisc (audio CD only) |url=http://resources.jvc.com/Resources/00/00/95/lvt1213-001b.pdf |archive-url=https://web.archive.org/web/20200820112149if_/http://resources.jvc.com/Resources/00/00/95/lvt1213-001b.pdf  |archive-date=2020-08-20 |language=mul |page=14 |title=JVC RC-EX30 operation manual |date=2004 }} (2004 [[boombox]])</ref><ref>{{cite web |url=https://www.sharp.co.uk/cps/rde/xbcr/documents/documents/om/13_dvd/DVRW250H_OM_GB.pdf |quote=• Fast forward and review playback does not work with a MP3/WMA/JPEG-CD. |page=33 |language=en-gb |title=DV-RW250H Operation-Manual GB |date=2004 |access-date=20 August 2020 |archive-date=20 August 2020 |archive-url=https://web.archive.org/web/20200820113949/https://www.sharp.co.uk/cps/rde/xbcr/documents/documents/om/13_dvd/DVRW250H_OM_GB.pdf |url-status=live }}</ref>

MPEG-1 frames contain the most detail in {{nowrap|320 kbit/s}} mode, the highest allowable bit rate setting,<ref>{{cite web |title=Sound Quality Comparison of Hi-Res Audio vs. CD vs. MP3 |url=https://www.sony.com/electronics/hi-res-audio-mp3-cd-sound-quality-comparison |website=www.sony.com |publisher=[[Sony]] |access-date=11 August 2020 |language=en |archive-date=14 September 2020 |archive-url=https://web.archive.org/web/20200914005253/https://www.sony.com/electronics/hi-res-audio-mp3-cd-sound-quality-comparison |url-status=live }}</ref> with silence and simple tones still requiring {{nowrap|32 kbit/s}}. MPEG-2 frames can capture up to 12&nbsp;kHz sound reproductions needed up to {{nowrap|160 kbit/s}}. MP3 files made with MPEG-2 do not have 20&nbsp;kHz bandwidth because of the [[Nyquist–Shannon sampling theorem]]. Frequency reproduction is always strictly less than half of the sampling rate, and imperfect filters require a larger margin for error (noise level versus sharpness of filter), so an 8&nbsp;kHz sampling rate limits the maximum frequency to 4&nbsp;kHz, while a 48&nbsp;kHz sampling rate limits an MP3 to a maximum 24&nbsp;kHz sound reproduction. MPEG-2 uses half and MPEG-2.5 only a quarter of MPEG-1 sample rates.

For the general field of human speech reproduction, a bandwidth of 5,512&nbsp;Hz is sufficient to produce excellent results (for voice) using the sampling rate of 11,025 and VBR encoding from 44,100 (standard) WAV file. English speakers average 41–{{nowrap|42 kbit/s}} with -V 9.6 setting but this may vary with the amount of silence recorded or the rate of delivery (wpm). Resampling to 12,000 (6K bandwidth) is selected by the LAME parameter -V 9.4. Likewise -V 9.2 selects a 16,000 sample rate and a resultant 8K lowpass filtering. Older versions of LAME and FFmpeg only support integer arguments for the variable bit rate quality selection parameter. The n.nnn quality parameter (-V) is documented at lame.sourceforge.net but is only supported in LAME with the new style VBR variable bit rate quality selector—not average bit rate (ABR).

A sample rate of 44.1&nbsp;kHz is commonly used for music reproduction because this is also used for [[Red Book (audio CD standard)|CD audio]], the main source used for creating MP3 files. A great variety of bit rates are used on the Internet. A bit rate of {{nowrap|128 kbit/s}} is commonly used,<ref name="Woon-Seng" /> at a compression ratio of 11:1, offering adequate audio quality in a relatively small space. As Internet [[bandwidth (computing)|bandwidth]] availability and hard drive sizes have increased, higher bit rates up to {{nowrap|320 kbit/s}} are widespread. Uncompressed audio as stored on an audio-CD has a bit rate of {{nowrap|1,411.2 kbit/s}}, (16 bit/sample&nbsp;× 44,100 samples/second&nbsp;× 2 channels&nbsp;/ 1,000 bits/kilobit), so the bit rates 128, 160, and {{nowrap|192 kbit/s}} represent [[Data compression ratio|compression ratios]] of approximately 11:1, 9:1 and 7:1 respectively.

Non-standard bit rates up to {{nowrap|640 kbit/s}} can be achieved with the [[LAME]] encoder and the free format option, although few MP3 players can play those files. According to the ISO standard, decoders are only required to be able to decode streams up to {{nowrap|320 kbit/s}}.<ref name="Bouvigne" /><ref>{{Cite web|title=lame(1): create mp3 audio files - Linux man page|url=https://linux.die.net/man/1/lame|access-date=2020-08-22|website=linux.die.net|archive-date=22 August 2020|archive-url=https://web.archive.org/web/20200822103430/https://linux.die.net/man/1/lame|url-status=live}}</ref><ref>{{Cite web|title=Linux Manpages Online - man.cx manual pages|url=https://man.cx/lame|access-date=2020-08-22|website=man.cx|archive-date=22 August 2020|archive-url=https://web.archive.org/web/20200822103425/https://man.cx/lame|url-status=live}}</ref> Early MPEG Layer III encoders used what is now called [[constant bit rate]] (CBR). The software was only able to use a uniform bit rate on all frames in an MP3 file. Later more sophisticated MP3 encoders were able to use the bit reservoir to target an [[average bit rate]] selecting the encoding rate for each frame based on the complexity of the sound in that portion of the recording.

A more sophisticated MP3 encoder can produce variable bit rate audio. MPEG audio may use bit rate switching on a per-frame basis, but only layer III decoders must support it.<ref name="MPEG-2.5-2" /><ref name="LAME_GPSYCHO" /><ref name="TwoLAME" /><ref name="MPEG-1 and MPEG-2 BC" /> VBR is used when the goal is to achieve a fixed level of quality. The final file size of a VBR encoding is less predictable than with constant bit rate. Average bit rate is a type of VBR implemented as a compromise between the two: the bit rate is allowed to vary for more consistent quality, but is controlled to remain near an average value chosen by the user, for predictable file sizes. Although an MP3 decoder must support VBR to be standards compliant, historically some decoders have bugs with VBR decoding, particularly before VBR encoders became widespread. The most evolved LAME MP3 encoder supports the generation of VBR, ABR, and even the older CBR MP3 formats.

Layer III audio can also use a "bit reservoir", a partially full frame's ability to hold part of the next frame's audio data, allowing temporary changes in effective bit rate, even in a constant bit rate stream.<ref name="MPEG-2.5-2" /><ref name="LAME_GPSYCHO" /> Internal handling of the bit reservoir increases encoding delay.{{citation needed| date=December 2010}} There is no scale factor band 21 (sfb21) for frequencies above approx 16&nbsp;[[kHz]], forcing the encoder to choose between less accurate representation in band 21 or less efficient storage in all bands below band 21, the latter resulting in wasted bit rate in VBR encoding.<ref name="LAME Y" />

=== Ancillary data ===
The ancillary data field can be used to store user-defined data. The ancillary data is optional and the number of bits available is not explicitly given. The ancillary data is located after the Huffman code bits and ranges to where the next frame's main_data_begin points to. Encoder [[mp3PRO]] used ancillary data to encode extra information which could improve audio quality when decoded with its algorithm.

=== Metadata ===
{{main|ID3|APEv2 tag}}

A "tag" in an audio file is a section of the file that contains [[metadata]] such as the title, artist, album, track number, or other information about the file's contents. The MP3 standards do not define tag formats for MP3 files, nor is there a standard [[container format]] that would support metadata and obviate the need for tags. However, several ''de facto'' standards for tag formats exist. As of 2010, the most widespread are [[ID3|ID3v1 and ID3v2]], and the more recently introduced [[APEv2 tag|APEv2]]. These tags are normally embedded at the beginning or end of MP3 files, separate from the actual MP3 frame data. MP3 decoders either extract information from the tags or just treat them as ignorable, non-MP3 junk data.

Playing and editing software often contains tag editing functionality, but there are also [[tag editor]] applications dedicated to the purpose. Aside from metadata about the audio content, tags may also be used for [[Digital rights management|DRM]].<ref name="Rae" /> [[ReplayGain]] is a standard for measuring and storing the loudness of an MP3 file ([[audio normalization]]) in its metadata tag, enabling a ReplayGain-compliant player to automatically adjust the overall playback volume for each file. [[MP3Gain]] may be used to reversibly modify files based on ReplayGain measurements so that adjusted playback can be achieved on players without ReplayGain capability.