Editing Mel-frequency cepstrum (section)

{{Short description|Signal representation used in automatic speech recognition}}
In [[sound processing]], the '''mel-frequency cepstrum''' ('''MFC''') is a representation of the short-term [[power spectrum]] of a sound, based on a  [[Cosine transform|linear cosine transform]] of a [[Power spectrum|log power spectrum]] on a [[Nonlinear system|nonlinear]] [[mel scale]] of frequency.

'''Mel-frequency cepstral coefficients''' ('''MFCCs''') are coefficients that collectively make up an MFC.<ref>{{cite book | chapter = HMM-based audio keyword generation | author = Min Xu | title = Advances in Multimedia Information Processing – PCM 2004: 5th Pacific Rim Conference on Multimedia | editor1 = Kiyoharu Aizawa | editor2 = Yuichi Nakamura | editor3 = Shin'ichi Satoh | publisher = Springer | year = 2004 | isbn = 978-3-540-23985-7 | chapter-url = http://cemnet.ntu.edu.sg/home/asltchia/publication/AudioAnalysisUnderstanding/Conference/HMM-Based%20Audio%20Keyword%20Generation.pdf | archive-url = https://web.archive.org/web/20070510193153/http://cemnet.ntu.edu.sg/home/asltchia/publication/AudioAnalysisUnderstanding/Conference/HMM-Based%20Audio%20Keyword%20Generation.pdf | url-status = dead | archive-date = 2007-05-10 | display-authors = etal }}</ref> They are derived from a type of [[cepstrum|cepstral]] representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between the [[cepstrum]] and the mel-frequency [[cepstrum]] is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal spectrum. This frequency warping can allow for better representation of sound, for example, in [[Data compression#Audio|audio compression]] that might potentially reduce the transmission [[Bandwidth (computing)|bandwidth]] and the storage requirements of audio signals.

MFCCs are commonly derived as follows:<ref>{{cite journal|last=Sahidullah|first=Md.|author2=Saha, Goutam|title=Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition|journal=Speech Communication|date=May 2012|volume=54|issue=4|pages=543–565|doi=10.1016/j.specom.2011.11.004|s2cid=14985832 }}</ref><ref>{{Cite journal |last1=Abdulsatar |first1=Assim Ara |last2=Davydov |first2=V V |last3=Yushkova |first3=V V |last4=Glinushkin |first4=A P |last5=Rud |first5=V Yu |date=2019-12-01 |title=Age and gender recognition from speech signals |journal=Journal of Physics: Conference Series |volume=1410 |issue=1 |pages=012073 |doi=10.1088/1742-6596/1410/1/012073 |bibcode=2019JPhCS1410a2073A |s2cid=213065622 |issn=1742-6588|doi-access=free }}</ref>

# Take the [[Fourier transform]] of (a windowed excerpt of) a signal.
# Map the powers of the spectrum obtained above onto the [[mel scale]], using [[Window function#Triangular window|triangular overlapping windows]] or alternatively, [[Hann function|cosine overlapping windows]].
# Take the [[logarithm|logs]] of the powers at each of the mel frequencies.
# Take the [[discrete cosine transform]] of the list of mel log powers, as if it were a signal.
# The MFCCs are the amplitudes of the resulting spectrum.

There can be variations on this process, for example: differences in the shape or spacing of the windows used to map the scale,<ref name=":0">Fang Zheng, Guoliang Zhang and Zhanjiang Song (2001), "[https://link.springer.com/article/10.1007%2FBF02943243?LI=true#page-1 Comparison of Different Implementations of MFCC]," ''J. Computer Science & Technology,'' 16(6): 582–589.</ref> or addition of dynamics features such as "delta" and "delta-delta" (first- and second-order frame-to-frame difference) coefficients.<ref name=":1">S. Furui (1986), "Speaker-independent isolated word recognition based on emphasized spectral dynamics"</ref>

The [[European Telecommunications Standards Institute]] in the early 2000s defined a standardised MFCC algorithm to be used in [[mobile phone]]s.<ref name="etsi01">European Telecommunications Standards Institute (2003), [http://webapp.etsi.org/workprogram/Report_WorkItem.asp?wki_id=18820 Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithms]. Technical standard ES 201 108, v1.1.3.</ref>