Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are MFCC values?

So I know what is MFCC (Mel Frequency Cepstrum Coefficients). But I need to understand what each value is... Is it some sort of sound frequency value or what?

enter image description here

Let's assume we have this kind of matrix. So each row represents the coefficients of one frame, but what are those numbers? Is it maybe highest frequency or something?

like image 472
Nikas Žalias Avatar asked Dec 04 '25 09:12

Nikas Žalias


1 Answers

Cepstrum is typically derived by computing Discrete Cosine Transform of (symmetric) log power spectrum of a frame of speech; here, the log power spectrum [curve] is treated as a signal (https://en.wikipedia.org/wiki/Mel-frequency_cepstrum). So, the cepstral coefficients are measures of similarity between a sequence/curve (that represents the log power spectrum) and cosine waves of different 'frequencies'. The cepstral coefficients capture the rate with which the values of this sequence varies.

The first cepstral coefficient is the dot product of the log power spectrum with the [periodic] cosine wave whose one period begins at the origin (f=0) in the frequency domain, and ends at f=2*Pi radians (or equivalently, sampling frequency). An illustration: the log power spectrum of an vowel has high energy in the low frequency region (near f=0), and low energy in the high frequency region (near f=Pi). In other words, the slope of the log power spectrum curve in the range [0,Pi] has a negative slope in case of vowels. Since this variation of the log power spectrum is similar to that of the cosine wave mentioned above, the first cepstral coefficient of an vowel speech frame will have positive value. In contrast, cepstrum[1] of an unvoiced fricative such as /s/ will have negative value because its log power spectrum would have positive slope due to low energy at low frequency and high energy at high frequency as well as significant energy at low frequency due to voicing.

Similarly, cepstrum[2] would be positive if there is a major valley in the log power spectrum at f=Pi/2. The log power spectrum of a voiced fricative (ex: /z/) would come close to such a description because there is significant energy at high frequency due to fricative nature of the sound. Of course, cepstrum[0] is an average of log power spectral values; it captures the volume/loudness of the signal.

like image 59
speech.tifr Avatar answered Dec 07 '25 14:12

speech.tifr



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!