Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Mel-frequency cepstrum
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
{{Short description|Signal representation used in automatic speech recognition}} In [[sound processing]], the '''mel-frequency cepstrum''' ('''MFC''') is a representation of the short-term [[power spectrum]] of a sound, based on a [[Cosine transform|linear cosine transform]] of a [[Power spectrum|log power spectrum]] on a [[Nonlinear system|nonlinear]] [[mel scale]] of frequency. '''Mel-frequency cepstral coefficients''' ('''MFCCs''') are coefficients that collectively make up an MFC.<ref>{{cite book | chapter = HMM-based audio keyword generation | author = Min Xu | title = Advances in Multimedia Information Processing β PCM 2004: 5th Pacific Rim Conference on Multimedia | editor1 = Kiyoharu Aizawa | editor2 = Yuichi Nakamura | editor3 = Shin'ichi Satoh | publisher = Springer | year = 2004 | isbn = 978-3-540-23985-7 | chapter-url = http://cemnet.ntu.edu.sg/home/asltchia/publication/AudioAnalysisUnderstanding/Conference/HMM-Based%20Audio%20Keyword%20Generation.pdf | archive-url = https://web.archive.org/web/20070510193153/http://cemnet.ntu.edu.sg/home/asltchia/publication/AudioAnalysisUnderstanding/Conference/HMM-Based%20Audio%20Keyword%20Generation.pdf | url-status = dead | archive-date = 2007-05-10 | display-authors = etal }}</ref> They are derived from a type of [[cepstrum|cepstral]] representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between the [[cepstrum]] and the mel-frequency [[cepstrum]] is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal spectrum. This frequency warping can allow for better representation of sound, for example, in [[Data compression#Audio|audio compression]] that might potentially reduce the transmission [[Bandwidth (computing)|bandwidth]] and the storage requirements of audio signals. MFCCs are commonly derived as follows:<ref>{{cite journal|last=Sahidullah|first=Md.|author2=Saha, Goutam|title=Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition|journal=Speech Communication|date=May 2012|volume=54|issue=4|pages=543β565|doi=10.1016/j.specom.2011.11.004|s2cid=14985832 }}</ref><ref>{{Cite journal |last1=Abdulsatar |first1=Assim Ara |last2=Davydov |first2=V V |last3=Yushkova |first3=V V |last4=Glinushkin |first4=A P |last5=Rud |first5=V Yu |date=2019-12-01 |title=Age and gender recognition from speech signals |journal=Journal of Physics: Conference Series |volume=1410 |issue=1 |pages=012073 |doi=10.1088/1742-6596/1410/1/012073 |bibcode=2019JPhCS1410a2073A |s2cid=213065622 |issn=1742-6588|doi-access=free }}</ref> # Take the [[Fourier transform]] of (a windowed excerpt of) a signal. # Map the powers of the spectrum obtained above onto the [[mel scale]], using [[Window function#Triangular window|triangular overlapping windows]] or alternatively, [[Hann function|cosine overlapping windows]]. # Take the [[logarithm|logs]] of the powers at each of the mel frequencies. # Take the [[discrete cosine transform]] of the list of mel log powers, as if it were a signal. # The MFCCs are the amplitudes of the resulting spectrum. There can be variations on this process, for example: differences in the shape or spacing of the windows used to map the scale,<ref name=":0">Fang Zheng, Guoliang Zhang and Zhanjiang Song (2001), "[https://link.springer.com/article/10.1007%2FBF02943243?LI=true#page-1 Comparison of Different Implementations of MFCC]," ''J. Computer Science & Technology,'' 16(6): 582β589.</ref> or addition of dynamics features such as "delta" and "delta-delta" (first- and second-order frame-to-frame difference) coefficients.<ref name=":1">S. Furui (1986), "Speaker-independent isolated word recognition based on emphasized spectral dynamics"</ref> The [[European Telecommunications Standards Institute]] in the early 2000s defined a standardised MFCC algorithm to be used in [[mobile phone]]s.<ref name="etsi01">European Telecommunications Standards Institute (2003), [http://webapp.etsi.org/workprogram/Report_WorkItem.asp?wki_id=18820 Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithms]. Technical standard ES 201 108, v1.1.3.</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)