Editing Mel-frequency cepstrum (section)

==Applications==
MFCCs are commonly used as [[Features (pattern recognition)|features]] in [[speech recognition]]<ref>T. Ganchev, N. Fakotakis, and G. Kokkinakis (2005), "[http://www.wcl.ece.upatras.gr/ganchev/Papers/ganchev17.pdf Comparative evaluation of various MFCC implementations on the speaker verification task] {{webarchive|url=https://web.archive.org/web/20110717210107/http://www.wcl.ece.upatras.gr/ganchev/Papers/ganchev17.pdf |date=2011-07-17 }}," in ''10th International Conference on Speech and Computer (SPECOM 2005),'' Vol. 1, pp. 191–194.</ref> systems, such as the systems which can automatically recognize numbers spoken into a telephone.

MFCCs are also increasingly finding uses in [[music information retrieval]] applications such as [[genre]] classification, audio similarity measures, etc.<ref>
{{cite book
 | title = Information Retrieval for Music and Motion
 | author = Meinard Müller
 | publisher = Springer
 | year = 2007
 | isbn = 978-3-540-74047-6
 | page = 65
 | url = https://books.google.com/books?id=kSzeZWR2yDsC&dq=mfcc+music+applications&pg=PA65
 }}</ref>

=== MFCC for speaker recognition ===
{{confusing-section|date=August 2022}}
{{Essay-like|date=March 2023}}
Since Mel-frequency bands are distributed evenly in MFCC, and they are very similar to the voice system of a human, MFCC can efficiently be used to characterize speakers. For instance, it can be used  to recognize the speaker's cell phone model characteristics, and further the details of the speaker's voice.<ref name=":0" /> 

This type of mobile device recognition is possible because the production of electronic components in a phone have tolerances, because different electronic circuit [[Realization (systems)|realizations]] do not have exact same [[Transfer function|transfer functions]]. The dissimilarities in the transfer function from one realization to another becomes more prominent if the task performing circuits are from different manufacturers. Hence, each cell phone introduces a [[Convolution|convolutional]] distortion on input speech that leaves its unique impact on the recordings from the cell phone. Therefore, a particular phone can be identified from the recorded speech by multiplying the original [[frequency spectrum]] with further multiplications of transfer functions specific to each phone followed by signal processing techniques. Thus, by using MFCC one can characterize cell phone recordings to identify the brand and model of the phone.<ref name=":1" />

Considering recording section of a cellphone as Linear time-invariant ([[LTI system|LTI]]) filter:

Impulse response- ''h(n)'', recorded speech signal ''y(n)'' as output of filter in response to input ''x(n).'' 

Hence, <math>y(n)= x(n) * h(n)</math> (convolution)

As speech is not stationary signal, it is divided into overlapped frames within which the signal is assumed to be stationary. So, the <math>p^{th}</math> short-term segment (frame) of recorded input speech is:

:<math>y_pw(n) = [ x(n) w(pW-n) ] * h(n)</math>, 

where ''w(n)'': windowed function of length W.

Hence, as specified the footprint of mobile phone of the recorded speech is the convolution distortion that helps to identify the recording phone.

The embedded identity of the cell phone requires a conversion to a better identifiable form, hence, taking short-time Fourier transform:

:<math>Y_pw(f) = X_p w(f) H(f)</math>

<math>H(f)</math> can be considered as a concatenated transfer function that produced input speech, and the recorded speech <math>Y_p w(f)</math> can be perceived as original speech from cell phone.

So, equivalent transfer function of vocal tract and cell phone recorder is considered as original source of recorded speech. Therefore,

:<math>X_p w(f)= Xe_p w(f) X_v(f), H'(f) = H(f) X_v(f),</math>  

where ''Xew(f)'' is the excitation function, <math>X_v(f)</math> is the vocal tract transfer function for speech in the <math>p^{th}</math> frame and <math>H'(f)</math> is the equivalent transfer function that characterizes the cell phone.

:<math>Y_pw(f) = Xe_p w(f) H'(f)</math>

This approach can be useful for speaker recognition as the device identification and the speaker identification are very much connected.

Providing importance to the envelope of the spectrum which multiplied by filter bank (suitable cepstrum with mel-scale filter bank), after smoothing filter bank with transfer function U(f), the log operation on output energies are:

:<math>\log [|Y_pw(f)|] = \log [|U(f)||Xe_p w(f)||H'(f)|]</math>

Representing <math>H_w(f) =  U(f) H'(f)</math>

:<math>\log [|Y_pw(f)|] = \log [|Xe_pw(f)|] + \log [|H_w(f)|]</math>

MFCC is successful because of this nonlinear transformation with additive property.

Transforming back to time domain:

:<math>c_y(j) = c_e(j) + c_w(j)</math>  

where, cy(j), ce(j), cw(j) are the recorded speech cepstrum and weighted equivalent impulse response of cell phone recorder that characterizes the cell phone, respectively, while j is the number of filters in the filter bank.

More precisely, the device specific information is in the recorded speech which is converted to additive form suitable for identification.

cy(j) can be further processed for identification of the recording phone.

Often used frame lengths- 20 or 20 ms.

Commonly used window functions- Hamming and Hanning windows.

Hence, Mel-scale is a commonly used frequency scale that is linear till 1000 Hz and logarithmic above it. 

Computation of central frequencies of filters in Mel-scale:

:<math>f_{mel} = 1000 \log(1+f/1000)/\log2</math>,  base 10.

Basic procedure for MFCC calculation:

# Logarithmic filter bank outputs are produced and multiplied by 20 to obtain spectral envelopes in decibels.
# MFCCs are obtained by taking Discrete Cosine Transform (DCT) of the spectral envelope.
# Cepstrum coefficients are obtained as:

<math>c_i = \sum_{n=1}^{N_f} S_n \cos\left(i(n-0.5) \left( \frac{\pi}{N_f} \right)\right)</math>    , <math>i =1, \dots, L</math>, 

where <math>c_i = c_y(i)</math> corresponds to the <math>i</math>-th MFCC coefficient, <math>N_f</math> is the number of triangular filters in the filter bank, <math>S_n</math> is the log energy output of <math>n</math>-th filter coefficient, and <math>L</math> is the number of MFCC coefficients that we want to calculate.