Thumbnail for Mel-Frequency Cepstral Coefficients Explained Easily by Valerio Velardo - The Sound of AI

Mel-Frequency Cepstral Coefficients Explained Easily

Valerio Velardo - The Sound of AI

51m 43s7,272 words~37 min read
YouTube auto captions
Transcript source

YouTube auto captions

This transcript was extracted from YouTube's auto-generated caption track. The transcript below is server-rendered so it can be read, searched, cited, and shared without opening the original YouTube player.

Timestamped outline
Pull quotes
[0:00]Hi everybody, and welcome to a new exciting video in the audio signal processing for machine learning series.
[0:00]In other words, mel-frequency cepstral coefficients or, if we use their acronym, MFCCs.
[0:18]But before we get started with this super cool topic, I want to remind you about the Sound of AI Slack community.
[0:18]So, if you sign up there, you can get feedback, share projects and share ideas with a community of people who are interested in AI audio, AI music, and audio signal processing.
Use this transcript
Related transcript hubs

[0:00]Hi everybody, and welcome to a new exciting video in the audio signal processing for machine learning series. This time we'll look into a very important audio feature. In other words, mel-frequency cepstral coefficients or, if we use their acronym, MFCCs.

[0:18]But before we get started with this super cool topic, I want to remind you about the Sound of AI Slack community. So, if you sign up there, you can get feedback, share projects and share ideas with a community of people who are interested in AI audio, AI music, and audio signal processing. So I really invite you to check this community out and I'll leave you the link, the sign up link to the Slack workspace in the description box below. Now, let's move on to the cool stuff. But before we get to MFCCs, I want just let to remind you about what we did in the previous couple of videos and we focused on mel spectrograms. Now, mel spectrograms are going to be like an important building block to understanding MFCCs. So, if you are really not that familiar with that, I highly suggest you to go check out my previous couple of videos on mel spectrograms. Okay, but now let's get started with MFCCs that as I said, build on top of the concepts of mel spectrogram, to a certain extent. Okay, so now we have this audio feature, it's called mel-frequency cepstral coefficients, right? So, in this feature we have many different words. So, now let's try to understand which word means what. Okay, so mel-frequency, well, mel-frequency, as I said, refers somewhat to the concept of a mel spectrogram. Basically, the idea is that we are using the mel scale here, which is a perceptually relevant scale for pitch. And there's something that has to do like with mel spectrograms and mel scale, like, in MFCCs. Okay, so we know that and we know that what mel spectrograms are from previous videos. Okay, so now let's move on. The last point here is coefficients. Well, this isn't really like that difficult to understand because the idea that probably you may guess, like from from like this name is that out of this features, you're going to get a number of coefficients, a number of values. And those coefficients will describe some characteristic of a piece of sound, right? That's all it is, right? Okay, and finally, we have probably the most interesting part here, that's cepstral. Right. So, this is a weird word, right? And cepstral is the adjective, but if we want to move to the noun, the noun is cepstrum. Okay. Does this word ring a bell at all? No? If not, I'll give you a hint. Ceps. Just like focus on this, like, four letters here. Any idea? If not, I'll give you the answer. It's spectrum. Right? So, if you, if you just like take ceps and you spell it like backwards, you'll have spec and spectrum. Okay, so cepstrum is somewhat related to spectrum, okay? So, here we have clearly a word play. And so, it's going to take us like some time to understand why this is like relevant and why researchers who came up with the idea of cepstrum, um, used like this word and they had this kind of like word play on spectrum. So, I suggest you just like to bear with me because this is going to be like a quite intense and in depth session to understand cepstrum and then once we understand cepstrum, we're going to use like this concepts to build MFCCs or to see how we can build MFCCs on top of cepstrum. Okay, so now let's put cepstrum and spectrum like down there. But when we are talking about cepstrum, it's not only cepstrum, the, the weird words that we, we have, or that these researchers who came up with this idea, came out with. So, there are a bunch of other concepts there. So, there's the concept of quefrency, liftering and rhamonic, for example. Now, I guess like you you you you have an idea of how like to translating like these things into stuff that makes sense. And indeed, right, quefrency is a wordplay on frequency. Liftering is connected to some sort of filtering. And rhamonic is connected to harmonic. Okay, so now we are entering the world of cepstrum, where we don't have frequencies, but we have quefrencies. We don't have filtering, but we have liftering, and we don't have harmonics, but we have rhamonics. Sounds a little bit weird, right? Yeah, and it is. So, bear with me to understand what all of these things like really mean. Okay, so now, uh, let's get like a, an historical understanding of the the concept of cepstrum, where like it it came out from and how it developed over time. So, researchers, I believe at MIT during the '60s, came out with this concept of cepstrum and they used it to study specifically, um, echoes in seismic signals. Then other researchers noticed that like this concept of cepstrum could be nicely applied to speech processing. And indeed, towards the end of of the '60s, uh, cepstrum started to be used like in the speech processing community. And during the '70s and onwards, it became the kind of like audio feature of choice for speech recognition, speech identification and all sorts of speech processing related problems. And that remained like that for a long time, until I think like the advent of deep learning, so, very recent stuff then. Okay, and in the 2000s, um, cepstrum started to be adopted in the form of MFCCs also in music processing, and specifically in music information retrieval. So, as you see, now we have an audio feature, a miss a mystery audio feature that can serve many different purposes, or many different applications. It works well for seismic signals, it works great for audio, uh, I mean, speech, uh, processing, and it also works really well for music processing. Cool. Okay. So, now it's time to understand like this mysterious, uh, audio feature a little bit more. So, and here we'll do like, uh, we'll try to understand this like in a few different, on a few different levels. So, the first level will be a mathematical formalization of the concept of cepstrum. Then we'll look into the visualization of cepstrum so that hopefully you can understand what's going on there for real. And finally, we'll look at cepstrum in the context of speech. And there probably you'll have like the better intuition out of all of these approaches. Okay, let's get started with the math behind it. So, how do we compute the cepstrum? Well, we compute it like this. So, now, here we have like our cepstrum and we indicate that like as capital C. And it the cepstrum is provided by like this formula here. So, let's get started with x of t. Well, x of t is just like a normal, uh, signal in the time domain, right? It's just like a normal waveform. Then out of this normal waveform, what we do is we take the, uh, discrete Fourier transform, which here I've indicated with this capital F. And so, when we do that, we come up with a spectrum. And we move from the time domain to the frequency domain. Okay. Now, the next step that we want to do is apply a logarithm to the spectrum. And in this way, we get the log amplitude spectrum. So, in other words, we are applying the logarithm on the amplitudes of the spectrum. Now, uh, if you, if you are not familiar with the Fourier transform or log logarithm amplitude spectrum, all of this kind of stuff, I highly suggest you to go check out my previous videos on the Fourier transform, uh, because all of these things I've addressed them time and again like in my previous videos. Okay. Good. So, we said we start from the signal. We take the Fourier transform, so we, we move to a spectrum. We take the log amplitude spectrum. And finally, at this point, we do the the, the kind of the key step to get to a cepstrum, which is basically applying an inverse Fourier transform to a log amplitude spectrum. And when we do that, we come up with a cepstrum.

[9:30]Well, if you think about it, what we are actually doing is we are taking a spectrum, specifically a log amplitude spectrum, and then we are calculating a spectrum of a spectrum, right? So, and that's because we are applying the inverse Fourier transform at this point. Okay. So, we could have called the cepstrum spectrum of a spectrum, right? But that wouldn't sound really cool. So, what the researchers decided to do is use like this word play and so they decided to use like this cepstrum. And the reason why they they used like this, they just like took like the first like four letters in spectrum and kind of like use them like backwards, it's going to be clear like in a few moments. So, bear with me. Okay. So, now we we know like that on a very high level, the spectrum, well, the cepstrum is the spectrum of a spectrum. Okay, but now let's try to visualize this concept because this is going to help us understand what's going on here. Okay, so we start with a normal waveform.

[10:45]We are in the time domain, we have like a very short amount of sound, just like 40 milliseconds for example, here. And then what we want to do here, as a first step, is taking the discrete Fourier transform. And what we get out of that is our usual power spectrum, where on the x-axis we have frequency and on the y-axis we have power. So, we've seen this like time and again, time and again during this series, and basically what we've done here is moving like from the time domain to the frequency domain and, um, the values that we have for each frequency tells us how much each frequency component is present in the original signal, in the original waveform. Right. Okay. So, this is like the the first step. Now, the other step is instead of, um, applying logarithm to a power spectrum. And we, yeah, let me just like shift like the power spectrum here on the left. And now we can apply the logarithm. And so, here, what we do is a simple transformation.

[11:57]So, we take like all the amplitudes and we apply like a logarithm so that we get decibels, right? On the y-axis as the, the unit of reference, and on the x-axis, still, we have frequency, right? Because the, um, transformation was only like on the y-axis, really. Okay. So, now we have the log power spectrum. Now, let me shift this like onto the left once again. So, what about the log power spectrum? So, first of all, it is a continuous signal, right? Second point, it has some periodic structures, right? And those periodic structures are present because the log power spectrum, uh, has like some harmonic components, or like the original signal has some harmonic components that gets, uh, that become, kind of like, periodic in the spectrum. And so, when we have a signal, even if it's just like a spectrum, that has some periodicities, what we can do is apply a transformation, like a Fourier transform, to understand like the different components, and try to find like which frequencies, right, are present in the signal. So, in other words, what we can do here is treat this log power spectrum as a signal, a time domain signal. And we can apply a Fourier transform-like transformation, right? And specifically, we'll be applying an inverse Fourier transform. And what we'll get is a spectrum of this signal, which is a spectrum. So, in other words, is the spectrum of a spectrum, which is the cepstrum. And here we go. We apply the inverse discrete Fourier transform and we get the cepstrum, which is the spectrum of a spectrum. Now, the cool thing, though, or the thing that we should think about is what do we actually have on the x-axis, right? So, because now we have the spectrum of a spectrum. I mean, if you are in the time domain, and you, and you take like a Fourier transform, then you move in the frequency domain. So, on the x-axis, you'll have frequencies, obviously. But if you start from a signal that has like frequencies on the x-axis, what do you actually get on the x-axis of the transformation, right? And the, the answer is that we get like some sort of pseudo-frequency axis. And this pseudo-frequency axis was termed by, uh, the researchers as quefrency. And the unit of reference here is milliseconds, or seconds. Now, let me show you why we are talking about quefrency and cepstrum. So, why this wordplay makes sense. And the reason is because we are starting from the time domain, originally with a waveform, then we go to the frequency domain, with the initial discrete Fourier transform. Now, we apply another discrete inverse discrete Fourier transform at this point, and we go back to to something that resembles like a frequency domain, but it's not really a frequency domain, right? And so they just like decided to take like the opposite of that. So, it's not frequency, it's quefrency. And this is not a spectrum. This is a cepstrum. Right. Okay. So, here you have like the intuition. Okay. So, what are like all of these values here, right? So, in the cepstrum, um, visualization, right? We have certain peaks. And here we have like a very high peak there. So, what do they represent, right? And so, this basically represents how present these different quefrencies are in the log power spectrum. Right. Okay. So, here we have like this huge peak. And that is the first rhamonic. I bet like you guess you, you realize this like yourself. This is like the equivalent of a harmonic, right? And this is the rhamonic that provides us information, or this is like, let's put it this way, this is the quefrency, where that is associated with the fundamental frequency of the original signal, of the original waveform. And indeed, one way of using cepstrum, uh, cepstra, I should say, is one application is for pitch detection. Because you take like the log power spectrum and then you take the the cepstrum, and the peak that you're going to have like this is going to be like the first rhamonic, and you can use that to then move back to the frequency domain and then understand where you have like the fundamental frequency in your original signal. And so, why is this like such a peak? Well, this is a peak because this reflects the harmonic structure of the original signal that gets that gets like represented in a periodic way here in the log power spectrum. So, this is like the the key idea there. Okay, so I guess like now we have a, an understanding of the math behind the cepstrum, and we also have an understanding of what the cepstrum looks like. But I think like what what still is missing here is understanding, to having like an intuition of the cepstrum. So, and why it is so important? Why should we bother taking the inverse discrete Fourier transform of the log power spectrum? Why should we bother? Right? Okay. So, for understanding that, we have to take a little detour into how speech works. And into speech processing, really.

[18:19]Okay, and so, the first thing that we need to do is understand how we produce speech. And a key element to understanding how humans produce speech is the vocal tract. So, the vocal tract is kind of a very complex system that has like multiple elements. So, it has like the tongue, it has the teeth, it has the nasal cavity, your throat. And the basic idea is that, depending on how you shape your vocal tract, you're going to produce different sounds, different what like linguists, linguists, I believe it's called, like, call phonemes, or like different consonants, different vowels that you can produce. Okay. And so, depending on how you put your tongue, how you, you stretch your throat, or you contract it, right? And but if we think about this, uh, that in terms of like digital signal processing, we can think of the vocal tract as a filter. In other words, the vocal tract acts as a filter. So, how do we actually generate, produce speech? Well, this is like quite fascinating and I'll give you like a a simplification of what like the real thing is, but it's going to be instrumental to understand cepstrum fully. Okay. So, speech generation acts in a kind of like pipeline form. So, initially, you have what we call a glottal pulse. And this is like a signal, a noisy signal, a high pitched signal that gets generated by the vocal folds, right? And that signal passes through the vocal tract, and the vocal tract acts as a filter on the glottal pulse. And by filtering like the initial signal, it creates the speech signal. Now, the basic idea, once again, is that depending on how you shape your vocal tract, then you're going to have like a different speech signal starting from the more or less the same glottal pulse. Now, the intuition here is that the glottal pulse carries information about pitch, or high frequency, kind of like information. Whereas, like the vocal tract, or I should say, like the, the frequency response provided by the vocal tract, by this filter, provides is going to, kind of like, carry information about the timbre of, of the sound, of the speech. And specifically, the timbre, when we talk about speech, is like the actual phonemes that you utter, that you produce, right? The different consonants, or the different vowels that you can produce. Okay, so this is kind of like the high-level idea. Now, let's take a look at a kind of visualization of all of this. So, we start with a speech signal that looks like this, right? Okay, so and here we are representing like this, well, it's not really like a speech signal in the time domain, is a speech is the log spectrum, log amplitude spectrum of a short amount of speech. Okay. And it looks like this. So, we can think of this like as I, as I said, like as a log amplitude spectrum. But now, one thing that we could do is kind of like try to smooth, uh, this signal here, right? And so, how can we do that? Well, we can take the envelope. And what we actually do is we take the so-called spectral envelope. And now, we already said like a similar idea in the time domain, when we discussed the amplitude envelope. And I have a couple of videos on that. One is like fully theoretical, so you can understand what the amplitude envelope, so how to calculate the amplitude envelope, and then I have another video where I actually implement the amplitude envelope, obviously, in the time domain, with Python from scratch. But, but basically, like, we we take that idea and we put it here like in the spectral domain, in the frequency domain. And so, here we have like the spectral envelope. It's basically like smoothing like all the complexity, or like the, the, the quickly changing like information like here like in this signal, right? Okay. Now, what's cool about this? Well, it turns out that there's like something that's extremely important in how we perceive speech and sound that the spectral envelope captures. And it's these peaks in red that you see there. So, those peaks are called formants. Now, formants are responsible for for, kind of like, identify for carrying the identity of sound. So, yeah, identity of sound, sounds really wishy-washy, so what's that? Well, that is like the timbre. So, depending on the formants that you have in a speech signal, then you're going to perceive certain phonemes instead of others. In other words, the spectral envelope provides us information about timbre, about the the different like phonemes that we have in speech. So, this is extremely important because like this is like a feature that we want to isolate to do speech processing. Okay. Okay. So, the spectral envelope turns out is something like very similar to the vocal tract frequency response, right? And this is like the the impulse response like of the vocal tract, depending on how we shape the vocal tract, and it's going to give us like a signal like this that resembles like this spectral envelope here, right? And depending on how you shape your vocal tract, uh, you're going to have like a slightly different vocal tract frequency response, with different formants, right? And that is going to determine different timbres, different identities of sound. Okay, so this I I hope like you are starting to understand how important like this is, right? And now, if we think about like, if we take like this initial signal, so now we have like the smoothed version of the signal, right? So, now we can, kind of like, subtract the two, and what remains is something like this, right? And it's a lot, like of quickly changing information here. And we can call this like the spectral detail. And the cool thing is that the spectral detail maps really nicely into the glottal pulse. Okay. Wow, that that that that this is like really fascinating stuff. So, we have like an initial speech signal, and we can decompose that like into two parts. So, one that's connected with the vocal tract frequency response, and the other one that's just connected with the glottal pulse. Right. Okay, but are we really interested in the glottal pulse? Well, really not that much in terms of audio, well, speech processing. And that's because, yeah, pitch is important, but not really that important. What we really care about is the identity of sound, so it's the phonemes, it's the timbre, it's the formants, right? And the formants and all of this stuff is carried by these component of the speech signal. So, what we want to get at is a set of features that enables us to work only with this part of the speech so that we can just like throw out the glottal pulse, because we don't need that for audio processing, for speech processing, or speech recognition, right? Okay. So, we should find a process through which we can start from a speech signal like this, or log spectrum speech like this. And then move and isolate the vocal tract frequency response component. How can we achieve that? Well, cepstrum comes to the rescue. Here, guys, we have the visualization for three log spectra. So, up here you have the log spectral relative to speech, and then down here you have like the two different components.

[27:10]Okay, so now if we want to take the the cepstrum, what we should do is apply the inverse discrete Fourier transform to this speech spectral signal. So, if we do so, we move from the frequency domain to the quefrency domain. But if we want to like see like the details of how to do that, well, basically what we do is we take like sign waves, like with different frequencies, and we try to fit them onto the spectral signal up here. And basically, like, what we're what we want to do is try to decompose that signal into its quefrency components, and see how present the different quefrency components are. Okay, so we start with low frequency sine waves. And, uh, if you like, for example, like take a look at this speech spectral signal here, right? You see, and that you have like four peaks here. And it's easier to see down here in the spectral relative spectral envelope.

[28:34]So, you have a peak one, two, three, four. So, probably a sine wave that has a frequency of 4 Hz, is going to do a pretty good job at approximating, um, this spectral signal.

[28:54]And so, what that means is that when we move to the quefrency domain, in other words, the cepstrum, is that we're going to have a natural, physical separation of the information that's relative to the spectral envelope, or in other words, the vocal tract frequency response, and the information that's connected to the spectral details, or glottal pulse.

[29:49]Okay. We can capture all of this information through the mathematical formalization. And here you can see that the cepstrum, which is capital X of T, is given by the sum of two components. So, all the cepstral coefficients that are relative to the glottal pulse, uh, added to all the cepstral coefficients that are connected to, uh, the spectral envelope.

[30:22]Now, if you remember, our goal, uh, and the reason why we moved to the spectrum, is because we want to just focus on the features relative to the spectral envelope. So, how do we do that? Well, here comes the last weird word that we introduced earlier, in other words, the liftering, or lifter. What we want to use here is a low pass lifter, which is basically, a nice way of saying that we want a low pass filter that's just going to remove all the high values related to the cepstral details, or glottal pulse. Okay? And so, once we do that, we remain only with the cepstral coefficients connected to the spectral envelope, which is the stuff that we wanted.

[31:21]Now, if you remember, our goal, uh, and the reason why we moved to the spectrum is because we want to just focus on the features related to the spectral envelope. So, how do we do that? Well, here comes the last weird word that we introduced earlier, in other words, the liftering, or lifter. What we want to use here is a low pass lifter, which is basically a nice way of saying that we want a low pass filter that's just going to remove all the high values related to the cepstral details, or glottal pulse. Okay? And so, once we do that, we remain only with the cepstral coefficients connected to the spectral envelope, which is the stuff that we wanted. Now that we know about the cepstrum, we can move on and understand what mel-frequency cepstral coefficients are. The cool thing is that MFCCs build on top of cepstrum. So, that is going to be a piece of cake for us. The best way we can understand how MFCCs work is by looking at how we can compute them. And this is a multi-step process. Many of the steps are shared by how we compute cepstrum and MFCCs. So, let's get started. We begin with a simple waveform, so a signal in the time domain. As usual, we apply a Fourier transform and we get a spectrum out of that. Next step, uh, is to apply a logarithm to the amplitude so that we get a log spectrum. And up until this point, the process for getting cepstrum and mel-frequency cepstral coefficients is actually the same, right? But here we have the first divergence. So, what we do next is applying mel scaling. What this means is that we take the log spectrum and we apply the mel filter banks, which are these triangular filters like this. Right. So, if you've followed along my series, you should be familiar with this image because I've used it in the previous couple of videos when we were talking about mel spectrograms. Now, if you're not familiar about, uh, with like mel spectrograms or mel scale, I highly suggest you once again to go check out my previous videos. But at the end of this step, we have now a mel spectrogram. We now enter the final step for getting MFCCs, which is instead of applying the equivalent of an inverse discrete Fourier transform for the cepstrum. And in this case, that one transformation is the discrete cosine transform. Now, I'm not going to get into the details of why we're using the discrete cosine transform instead of the inverse Fourier transform. I'll do that like in a few moments. But for now, all you need to understand is that once we apply the discrete cosine transform, is that we get a number of coefficients, a number of values, or MFCCs, which are the ones that we are interested in. Okay. So, one thing I want to draw your attention to is the type of transformations, or the type of like steps that we are using like in this multi-step process for getting MFCCs. And the cool thing about this is that at each step, we have a process that's somewhat perceptually informed. It's perceptually relevant. So, let me explain what I mean by that. So, we start with the signal, a waveform. Okay, at that point, we get the we apply a discrete Fourier transform, so that we can move to the time domain. All good and well. At this point, we apply a logarithm on the amplitude. And this is something that's perceptually relevant. And that's because, you may be familiar with this, because like you you've seen it like in earlier videos that I had on this on this series. So, we don't perceive amplitude or loudness linearly, but rather logarithmically. So, by applying a logarithm at this point, we are putting like a step that's perceptually relevant. The next step is similar to that, right? Because when we apply mel scaling, we are basically passing from a linear frequency representation to a mel-based representation, which is perceptually relevant in the realm of frequencies, right? And finally, when we apply the discrete cosine transform that's kind of similar to applying the inverse Fourier transform to get the cepstrum. And in this case, that one transformation is the discrete cosine transform. What we get out of that is information about the different values that kind of constructs like form the the different formants, or the timbre, or like the basic information about the spectrum that we need in order to understand like speech, understand formants and just like recognize speech, really. The question we should now ask is why using the discrete cosine transform? Can we just use the inverse Fourier transform? Well, it turns out there are a bunch of reasons why we prefer to use the discrete cosine transform for getting MFCCs. The first one is that the discrete cosine transform is a simplified version of a Fourier transform. And one of the reasons is that the discrete cosine transform gives us back real-valued coefficients. And this is different from what a Fourier transform does. So, if you're not familiar with the Fourier transform, I highly suggest you to go check out this video. And there you'll find that a Fourier transform returns complex numbers, but we don't really need complex coefficients here. Real valued coefficients are more than enough for our purposes with MFCCs. So, discrete cosine transform is way simpler to handle with than a Fourier transform. Okay. So, now, one thing that I want to show you guys is how we move, how we can apply like this discrete cosine transform and move from the logarithm spectrum to the MFCCs. And basically, like, here the idea is that we get like cosines, like with different frequencies, and we try to fit them to the log spectrum, right? And each cosine is going to have like a different frequency, and it's going to, basically, come up with a value. That value is how well like that cosine with that specific frequency, um, fits the original log spectrum. And that value is an MFCC. The higher like the index of the MFCC, and the higher like the, the, the cosine signal that we pass that we try to fit to, uh, the log spectrum. Okay. Good. So, now I hope like you have like this idea of how to apply like this discrete cosine transform. Now, moving on, another advantage of the discrete cosine transform is that it enables us to decorrelate the energy in different mel bands. Okay, so what's this all about? Okay, here we have once again, the mel filter banks, and these are like triangular filters. So, you can see like the center like of a mel band, for example, like this one here, right, which is we can say like this is mel band number two, is somewhat correlated with what comes after it, the subsequent mel band and the previous mel band, right? And you can see it here, like there's some overlap. And that means that information is somewhat like correlated, is shared across multiple, um, mel bands. Now, when we apply the discrete cosine transform, what we do is we decorrelate the energy in the different mel bands, which is a really good thing to have. Because with machine learning algorithms, we want features that are as least correlated as possible. Okay. So, one final thing that comes with the discrete cosine transform is that it reduces the number of dimensions that we use to represent the log spectrum. In other words, we can think of the discrete cosine transform as a dimensionality reduction algorithm that takes like the input, which is this log spectrum, and it provides us back with, um, like a a feature, a set of features like that is that has like a smaller dimensionality. Less dimensions. Okay. Good. So, now, I guess like one important question that you may have is how many coefficients should I take? How many MFCCs? Now, traditionally, we focus on the first, we consider the first 12 to 13 coefficients. And why do we take this, right? Uh, we take the first coefficients because these are the ones that keep the most relevant information, which is the information about formants and spectral envelope. And this is like the same stuff that we had with the cepstrum. If you recall on the quefrency axis, uh, the quefrency values like that are like on the lower end, are the ones that provide information about like the spectral envelope. Right? The quefrency values on the higher end are the ones that provide information about the glottal pulse. We're not interested in the glottal pulse. We are interested in the spectral envelope, or in other words, interested in the vocal tract frequency response, because that provides us information about the the stuff that's perceptually the most relevant, the phonemes, the formants, okay? So, the moment you take like the, the, the initial coefficients, you are taking information about like those formants. Higher coefficients provide us information about fast changing spectral details information, right? And we don't really need that that much for speech recognition. We're more interested in to like the formants as we said multiple times. So, all of this to say that, of course, you can take more MFCCs, but that it's not necessarily going to improve the quality of your algorithms like that much. But there's another strategy that's probably going to like boost the accuracy of your machine learning algorithms quite a lot, and it's taking the first and second derivatives of MFCCs, or in other words, taking the delta and the delta delta of MFCCs. Okay. What is this? Well, let's think about like MFCCs. So, if you remember like the pipeline for extracting them, so, multi-step, that one is used for each frame in a signal. Which basically means, if we have like a 10-second long signal, we're going to have like a ton like of frames. And at each frame, you're going to get a handful of MFCCs. Now, if you want to take like the delta MFCCs, what you do is you take all the values of MFCCs, uh, values at one frame, and you subtract to that like the delta MFCCs from the previous frame. So, that you get the delta. Now, if you want to get like the delta delta, you do the same thing, but you start like with all the delta MFCCs, and then you subtract like the values for the delta MFCCs, um, for one frame, and you subtract to that like the delta MFCCs from the previous frame. And in that way, you get the delta delta MFCCs. So, if you think about this, this is equivalent to taking more or less the first and the second derivative of this MFCCs. Okay.

[43:46]So, the, the whole figure there is probably like something like 39 coefficients. And remember, this is like for each single frame. Okay. So, now, I think like it's, we are at a point where we want to actually visualize like these MFCCs.

[44:20]And as you can see, like, when we visualize them, like, the the kind of like impact that we have is very similar to a spectrogram, that we saw in, uh, earlier videos, right? And that's because we have like MFCCs like, we can think of them like as a matrix, right? In the rows, we have like the different indexes, like the different MFCCs, the different coefficients. And on, uh, here, like on the columns, we have the different frames, okay?

[45:09]And so, we have like as many, uh, like, kind of like discrete segments here, as the number of frames that we have in a given, uh, chunk of signal, audio signal. And here, uh, we have like as many, um, discrete segments here, as the number of coefficients that we have. So, for example, in this case, you can count them. I believe like, uh, we take 16, uh, mel-frequency cepstral, uh, coefficients. And so, we, uh, at each point like in this matrix, we have the value for a given coefficient at a given point in time, or at a given frame. And that value is expressed visually through some kind of like color coding here, right? And this is very similar to what we did with the spectrograms, as well. Cool. Okay. So, now, uh, I want to talk about like a couple of like things just like to wrap up like this very long video. And I want to talk about some of the MFCCs advantages, so some of the benefits that come with MFCCs, some of the shortcomings, and finally, the applications of MFCCs. So, let's talk, let's start with the good stuff. So, MFCCs advantages. So, a great thing about MFCCs is that they're able to describe the large structures of the spectrum. And we saw this and we, we, we said this like multiple times. So, the moment like we take like the first MFCCs, we are mainly focusing on the spectral envelope, on the formants. And so, like the different coefficients, MFCC, uh, coefficients are basically providing us information about the formants, about the phonemes. And this is all we really need. We're just like cutting out the details and the noise that comes with the with the spectral details, and focus on the main stuff, on the phonemes, on the formants. Okay. Uh, and yeah, so the second point, we ignore the fine spectral structures, which we really don't care that much. So, we don't care about pitch, when we do, mostly, when we do speech recognition, uh, for example. Okay. So, uh, and the great thing about MFCCs is that like they've been tested for a long time, both in speech and music processing, and they work quite well. That's the point. And they've been like the, the audio feature of choice for speech and music processing for a long time. Right? Okay. Now, let's take a look at some of the disadvantages. So, first of all, MFCCs are really not that robust with respect to additive noise. And then, the second point, I think like it's at the same time a great thing and it's also like a shortcoming. So, as we say, like to to come up with MFCCs, we have to put a lot of like knowledge regarding like the way also we humans perceive speech, or music. Um, and that can be a shortcoming. And why is that? Well, that's because we're somewhat biasing this audio feature based on biases like that we have. And we are not letting the machine decide like for itself, what's the, what are like the most relevant elements, for example, in a in a raw audio file, or in a spectrogram, right? And some of these decisions, arbitrary decisions that we take is like the, the, the mel scaling, like that we do on the spectrogram. So, when we move from the spectrogram to the mel spectrogram, for example, there we use the mel scale, right? And the mel scale is definitely like a valid, like, informed perceptually informed pitch scale, but it's not the only one. We could have used another one that's called bark, for example.

[49:22]Or even like when we're dealing with like the filters, the mel filter banks, there, we are using, um, triangular filters, but that is like an arbitrary decision. So, right, we are injecting some level of bias in there. So, and the machine may not need that. It could just like use raw, um, data, like spectrograms, for example, and figure out what it needs, learning it directly from data, without us biasing the data. Okay. And finally, a major disadvantage is that MFCCs are great for analysis. We can do like all types of like analytical stuff, like speech recognition, music genre classification, but they're not great for synthesis. Because we don't really have a an inverse. So, we're not capable of moving from MFCCs back to raw audio in in a perfect like manner. So, we only have approximation of that. So, you can't really use MFCCs for synthesizing audio. Cool. Okay. So, now, I think like you should really, really congratulate like yourself because, uh, we've done a lot of work like in this video. So, we've looked at the cepstrum. We looked like at the math behind the cepstrum. We looked like at the visualization of the cepstrum. How we can interpret cepstrum in the context of speech. And then we moved on to mel-frequency cepstral coefficients. But up until now, we've just, uh, dealt with theory. So, as is usually the case in this series, uh, the next step is going to be that of implementation, or just like playing around with Python. So, in the next video, we'll use all of the information that we've used here in this video to extract MFCCs with Python and Librosa, and we'll also visualize MFCCs for different audio files. Okay. So, I really hope like you enjoyed this video. And if you found it useful, yeah, please leave a like. If you haven't subscribed and you want to have more videos like this, please do remember to subscribe. And I guess that's it for today. I'll see you next time. Cheers.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript