TubeScript Get a Transcript

Thumbnail for Mel Spectrograms Explained Easily by Valerio Velardo - The Sound of AI

Mel Spectrograms Explained Easily

Valerio Velardo - The Sound of AI

30m 32s4,478 words~23 min read

YouTube auto captions

Transcript source

YouTube auto captions

This transcript was extracted from YouTube's auto-generated caption track. The transcript below is server-rendered so it can be read, searched, cited, and shared without opening the original YouTube player.

Timestamped outline

[0:00]Section 1

Hey everybody and welcome to a new exciting video in the audio signal processing for machine learning series. This time we'll be looking...

[14:41]Section 2

In other words, so this is going to be like very simple, like, to visualize, uh, so let's say, like, we have, like, six, uh, Mel-bands. W...

[20:20]Section 3

So, in other words, like, below this, uh, frequency, the lower end frequency, uh, like, we are not going to have, so the, the signal is c...

[22:54]Section 4

So, I hope you now have an understanding of, like, of this Mel-filter bank, and then now, with all of this information, we need to go, li...

[27:43]Section 5

So basically, the conceptual understanding of visualization of Mel-spectrograms is the same that we had for spectrograms. The only thing...

Pull quotes

[0:00]Hey everybody and welcome to a new exciting video in the audio signal processing for machine learning series.

[0:00]This is a flavor of spectrograms that is extensively used in AI audio research and AI audio real-world applications.

[0:00]But before we get started into Mel-spectrograms, I want to give you an overview of what we did over the last couple of videos.

[0:00]And that is, the short-time Fourier transform, and we also looked into extracting and visualizing spectrograms.

Use this transcript

Summarize a YouTube transcript Make study notes Find timestamped highlights Export to Markdown Download transcript files Browse related transcript hubs

Related transcript hubs

Transcript archive Auto Captions hub English transcripts AI transcripts Tutorials transcripts

Watch on YouTube

Share

[0:00]Hey everybody and welcome to a new exciting video in the audio signal processing for machine learning series. This time we'll be looking into Mel-spectrograms. This is a flavor of spectrograms that is extensively used in AI audio research and AI audio real-world applications. But before we get started into Mel-spectrograms, I want to give you an overview of what we did over the last couple of videos. And that is, the short-time Fourier transform, and we also looked into extracting and visualizing spectrograms. Now, if you're not familiar with this concept, I highly suggest you to go check out my previous two videos, because what I'm going to say next in the video builds on top of those two notions extensively. Now, without further ado, let's move on to this visualization of a spectrogram, and as you can see here, on the x-axis we have time, on the y-axis, as usual, we have frequency. And each point, which is represented by a color in this 2D representation, tells us how present a certain frequency is at each point in time. The thing that I want you to understand here is that the frequency representation, with a normal vanilla spectrogram, is linear and it uses hertz. And this is kind of problematic in terms of like the way we perceive frequency. Now, I just, I don't just want to tell you that, I want to show you that, and to show you that, I we're going to do a little psychoacoustic experiment. So we have a couple of audio samples, so each of these has a pair of notes, so in the first one we have C2 that moves to C4, C4 being the middle C note, so the C note at the center of a piano keyboard. In the second sample we have G6 that goes up to A6. Now, if you take a look at the difference in the frequency mappings, or in other words in the frequency expressed in Hertz, between the two notes in the two pairs, you'll notice that more or less, these two differences are the same, or almost the same, and that is around 200 hertz. But will these two pair of notes sound the same in terms of like their pitch distance? This is my question to you. Okay, so let's listen to these two pairs of notes. The first one, C2 to C4. Okay, the second one, G6 to A6. Okay, I'm sure you'll agree with me that the second sample has the two notes that are way closer in pitch than the first two notes. And this is a little bit weird, isn't it, right? So the difference in frequency as expressed in hertz is actually the same, but that isn't true from a perceptual perspective. Because the, the second pair of notes sound way closer in pitch than the first two. This basically means that the way we perceive pitch is non-linear. And indeed, we have way better resolution at lower frequencies than we have at higher frequencies. Specifically, what happens is that humans perceive frequency logarithmically. And this is a little bit of a problem that we have with vanilla spectrograms, and that's because as I said, frequency is expressed linearly in uh, raw, like, or vanilla spectrograms. And so we don't account for the way we human beings perceive it. Now, let's think about an ideal audio feature that we would like to have, and would make like the most sense to feed to our own machine learning and deep learning algorithms. Well, first of all, we want some kind of a time-frequency representation, because we want to know how the different frequencies in a signal evolve over time. Then, we want for this feature to, feature to have perceptually-relevant amplitude representation, and this, once again, is because amplitude and the way we perceive it is logarithmic, it's not linear. Right? And both of these things we can achieve with vanilla spectrograms, and indeed we have like log amplitude spectrograms as we saw in the previous video. But what we can't achieve with vanilla spectrograms is the third aspect, which is perceptually-relevant frequency representation.

[4:43]And this is what Mel-spectrograms are all about, right? And they check out all of these three interesting conditions. But now, Mel-spectrograms, so it's a term, it's a concept with two words, one is spectrograms, and by now you should be more than familiar with it. The other one is mel. So what is this thing? What's a mel? Right? This concept is, is connected with the idea of, or concept of, a Mel-scale. And the Mel-scale is a perceptually-relevant, or perceptually-informed, scale for pitch. And here we have like a nice representation, so I suggest you like to focus in this graph right now. So on the x-axis, we have frequency expressed in hertz, on the y-axis, we have frequency expressed in mels, and this blue curve is just the mapping between hertz and mels. And as you can see, it follows a logarithmic, kind of like, function curve, and indeed the Mel-scale is logarithmic in nature, and by being logarithmic, what it does is it becomes perceptually-relevant, perceptually-informed. And what does this mean really? Well, it means that equal distances on the scale have the same "perceptual" distance. Okay, let's try to understand this with an example. So, let's say we have a couple of like pairs of notes. So the first one, the first pair is 500 hertz and 500, sorry, 500 mels and 510 mels. The second one is 1,000 mels and 1,010 mels. Now, the perceptual difference between these two, the two notes in the two pairs, is the same. So the pitch distance is going to be the same, and that's obviously not true for hertz as we demonstrated with our little experiment earlier. Okay, now, one thing that we did that, like, researchers did with the Mel-scale is that they standardized so that at 1,000 hertz we have, 1,000 hertz we have 1,000 mels. Now, you may be wondering, but how did we arrive at such a perceptual scale? Well, we arrived at it in an empirical way doing psychological experiments. Now, one and, last question you may have regarding, like, the, the Mel-scale is where does the term Mel come from?

[7:19]Well, Mel is an abbreviation for melody, and melody, I mean, is very connected with the concept of pitch, because pitch, I mean, is the main thing that you have in a melody, along with rhythm, I would say. And so, like, melody, mel, pitch, so you get the idea, right? Okay, but uh, now let's take a look at how we can move from uh, the frequency expressed in hertz to the frequency expressed in mels. And it's this, like, empirical, like, formula here, as I said, like, found by trial and error with experiments. And the important thing here, apart from all of these, like, constants, is like, the logarithm that we use, and indeed we have, like, this logarithm, logarithmic, kind of relationship. Now, we can take the inverse of this function, and with this, we're going to be able to move from mels back to hertz. Now, I want you just, like, to re, just, like, yeah, keep in mind this two formulas, because we're going to be needing them, like, moving forward. Okay, so now you should have, like, a basic idea of the Mel-scale and why it is important in terms of, like, the way we perceive music, or, not music, sorry, we perceive pitch in this case. Okay, but, uh, how does this relate to the Mel-spectrograms? Okay, and here I'll give you a kind of recipe for extracting Mel-spectrograms. And this recipe has three steps. So the first thing that we do is we extract the short-time Fourier transform. And the second thing is to convert the amplitude to decibels, in other words, uh, like, we take, like, some kind of, like, logarithmic representation of amplitude. Now, as I said, these two steps are, like, the ones that we use for vanilla spectrogram, so nothing new here. The new thing is the third step here, which is converting frequencies to the Mel-scale. So this is, like, the whole point of Mel-spectrograms. So we take a spectrogram, and then we convert the frequencies in that spectrogram to the Mel-frequency, uh, representation. Okay, so the next question is how do we do that? Well, I we, for converting, like, these frequencies to the Mel-scale, we have a bunch of, like, steps. And it's, like, these three steps here. So we choose the number of Mel-bands first, we compute the Mel-filter banks. Now, don't be scared about, like, this scary word Mel-filter banks, because I'll show you what this is, like, in a few minutes. And finally, we want to apply the Mel-filter banks to the spectrogram. Okay, so now I'm going to break down each of these steps for you, so that it becomes like super clear how to do this. Okay, so let's start with the first thing, so choosing the number of Mel-bands. Well, the, first of all, so Mel-bands, so when, when you're just, like, reading some research, say in AI audio or music information retrieval, and researchers are using, uh, Mel-spectrograms, one fundamental parameter that you have is the number of Mel-bands, right? And that is like a parameter that can vary. So the question is, like, how many Mel-bands, like, should you pick? Well, 40 is totally fine. 60 is fine. Sometimes you find numbers like close to 90 or even 128. The reality is it really depends on the problem. There isn't like one single answer to this question. It's a little bit when you ask, okay, so what learning rate should you use? Well, there isn't just, like, one answer. You have to try out different stuff and see what works, uh, best. And indeed, the number of Mel-bands is a hyperparameter that you can vary and see the impact that it has on your algorithms, on your deep learning algorithms. But all in all, you see that usually, like, the, the numbers that we have is between, I would say, like, 40 and 100, 20, 130, like, maximum. And, I mean, one thing, like, that's, uh, a little heuristic here is that if you think about the number of notes, for example, that we have on a piano keyboard, uh, that is, like, 88, right? That is, kind of like, the way, like, we tend to, like, resolve, like, frequencies, and we represent them. Now, the idea of a Mel-band is, like, kind of like, a, a range of frequencies, like, that are, that perceptually-relevant. So, I would expect, like, that this number, like, makes sense, because, like, they're kind of, like, close to, like, the notes that we usually experience, like, for example, like, on, on our, like, Western, uh, music. So, these types of numbers, like, make sense for that, because they're, they're comparable, they have, like, the same order of magnitude as the number of notes that we usually experience. Okay, good. So now we know, uh, like, what, like, numbers of Mel-bands, like, to choose from, more or less. Now, the next step is to actually construct the Mel-filter banks. And here we'll actually understand what these Mel-bands are. Okay, so how do we construct the Mel-filter banks? Well, this is a kind of like a quite complex process, or it's not really that complex, it's more like it's a multi-step process. It has like five different steps, and so I'm going to break down all of these five steps for you. I'm not going to get too much into, sorry, the math here, nor am I going to, like, implement it. But I highly suggest you to try to implement all of these steps that I'm just going to, uh, talk about in a theoretical way, because that is a very good exercise to see whether, like, you understand all of these steps, like, precisely. Okay, so how do we build, like, these Mel-filter banks? Well, first of all, we take a lowest and highest, uh, like, in the frequency, uh, like, in the frequency, like, range, and, like, in the short-time Fourier transform that we've just, uh, kind of like, uh, extracted. And then we convert the lowest and highest frequency to a Mel-representation. How do we do that? Well, we use like this little formula that we already encountered before when talking about the Mel-scale. Okay, so now we have like the lowest and highest frequency that we want to consider, expressed in mels. Next step. So, now, uh, given like we've already chosen the number of Mel-bands that we want to use, like, for our Mel-spectrogram, the next step is to take like the, uh, the frequency range, in terms of, like, Mel, the, the lowest Mel, like, over here, the highest Mel here, and then we want to create as many points equally spaced points as the number of Mel-bands that we want to use.

[14:41]In other words, so this is going to be like very simple, like, to visualize, uh, so let's say, like, we have, like, six, uh, Mel-bands. We want to use six Mel-bands. So what we'll do, uh, if this is, like, the lowest, uh, Mel-frequency and this is the highest Mel-frequency, we'll just, like, take these points here. So we'll just create these six points, and they'll be equally spaced in this, uh, frequency range here.

[15:06]Okay. Now, what we're going to do next is quite simple, really. It's basically just like converting this points back to Hertz by using this function here that we found when we were talking about the, uh, Mel-scale. Now, if you're wondering about this, Mel, these points, so what are they? Well, they are the center of the different Mel-bands, okay? They are the center frequency of the different Mel-bands that we are talking about here. Okay, and we'll see this like, uh, with a visualization in a couple of moments. Okay, next step. So we've now converted back like these, uh, center, uh, frequency points for our, like, Mel-bands back to hertz. And now what we want to do is we want to round these points to the nearest frequency bin. So why is this, like, important? Also, why do we need to do that? Well, it's because we are dealing with, uh, discrete, uh, signals. And that means that we don't have, like, infinite, perfect, uh, resolution, uh, on the frequency, like, on the frequency. And actually, like, our resolution is somewhat, like, constrained by the frame size of the short-time Fourier transform. So what that basically means is that when we'll get back to, uh, the, this, like, center, uh, frequency, uh, points, we, we can't just, like, take them, like, as they are. We need to, like, bin them, and we need to just, like, round them to the nearest, like, frequency bin that we have available, right? And finally, we have the, the last step, which probably is the most important, which is, like, creating triangular filters, and these triangular filters are basically are the, the kind of like, the building blocks of a Mel-filter bank. And they are connected with these idea of the different, like, Mel-bands. So I'll visualize what these things are, and then I'll explain how we get, like, to this visualization here. So don't be scared about, like, all of this complexity, because it's way easier than it looks, like. Okay, so here, on the x-axis, we have the frequency, in the bottom is expressed in hertz, in the top here is expressed in mels. And here on the y-axis, we have weights, and weight is between one and zero. Why do we have a weights there? Well, because these are filters, basically what they do is they just, like, tend to filter sounds, and when you have a weight equal to one, you're not touching, like, that, uh, signal. But, uh, with a weights below one, what you are doing is just, like, dumbing down, kind of, like, dumping the, uh, the, uh, the signal, right, because you're scaling it with a value that's below one, right? Okay. So, now, as you can see here, we, we have an example which, like, six Mel-bands, okay? And here, indeed, you have, like, six points, like, overall, and so these are the center frequencies for our six Mel-bands, okay? And so, uh, let's take, for example, like, this second point here. So this is, like, it's, let's say, like, it's at 2,000 hertz, which is 1,000, uh, 526 mels. Okay. So, what you can notice here is that the difference between, uh, all the, the center points for the Mel-bands, is, is the same when it's calculated in Mel. So, we basically have always, like, the same difference between, like, two subsequent, um, two subsequent, like, center frequency, uh, points. But this is not true in the case of hertz when, and that's the whole point, right?

[19:12]That's the whole point, because, um, which, like, frequency expressed in hertz when we go, like, towards higher frequencies, we have to just, like, spread out the frequencies, uh, to have, like, the same perceptual, uh, difference in terms of, like, frequency distances. Okay, but now let's focus only, like, on constructing a single triangular filter for one Mel-band. Okay. So, let's take, like, this point, uh, Mel-band number two here. So, the center of this Mel-band, as we said, is 1,000, or, uh, 1,526 mels. And then the lower end and the higher end, uh, so are taken by respectively taking the center of the previous Mel-band, and the center frequency of the subsequent Mel-band. And for the lower end and the higher end of a Mel-band, we have a weight which is, like, equal to zero.

[20:20]So, in other words, like, below this, uh, frequency, the lower end frequency, uh, like, we are not going to have, so the, the signal is completely muted, is put to zero, same thing here for the frequencies above the higher end, uh, here.

[20:40]Okay. And now what we do is we, kind of, like, trace a line between the lower and the higher end, and we connect that with the center frequency where the weight is equal to one. And so if we do that, we actually build a triangular filter like this. Now, if you do the same thing, uh, again, for the third Mel-band, fourth, fifth, and sixth, well, you just come up with the whole Mel-filter bank, right? And these are triangular filters, okay? And so, I hope that you now have, like, an understanding of, like, what a Mel-filter bank is. And the whole purpose of this piece is that then we can just, like, apply this to, uh, like a normal spectrogram, and then we can filter out, like, the different frequencies and convert them, like, to normal frequencies in hertz, to frequencies expressed in mels. Okay. But, here we have just, like, a visualization of a Mel-filter bank. But, uh, obviously, like, in digital signal processing, as well as, like, in machine learning, we, we don't do operations with visualizations, we, we actually use math and linear algebra, uh, like, most of the time, right? And so what this means is that we can represent this Mel-filter bank using a matrix or a two-dimensional array. And the shape of this matrix is going to be, like, this. So, like, the number of rows that we're going to have, like, in a Mel-filter bank, is equal to the number of bands that we chose in our example.

[22:31]It's going to be equal to six, so six rows. And then, the number of columns that we'll have, it's equal to the Nyquist frequency, or, in other words, frame size divided by two, plus one. And so, this is, like, the matrix that we'll be using for representing a Mel-filter bank. Okay.

[22:54]So, I hope you now have an understanding of, like, of this Mel-filter bank, and then now, with all of this information, we need to go, like, to the final step in the conversion from frequencies to Mel-scale for our spectrogram. In other words, we are actually moving from the spectrogram to the Mel-spectrograms. And how do we do that? Well, we have to apply the Mel-filter banks to the spectrogram. And how do we do that? And here linear algebra is going to help us quite a lot. Okay, so here we have like the, the shape of the Mel-filter bank, uh, matrix. And now here we have the shape of another matrix, and this is the matrix associated to a spectrogram. So, if you followed, along, like, uh, my last couple of videos, you know by now that the, the shape of a spectrogram, uh, the shape of a matrix that represents the spectrogram is given, like, by these numbers here. So the number of columns that we have, sorry, the number of rows that we have in the spectrogram is equal to the Nyquist frequency, uh, plus one, and the number of columns is given by the number of frames, or, in other words, like, the temporal bins.

[24:10]Now, so we said that the whole point here is to apply Mel-filter banks to spectrogram. But what does that mean? So that must mean something in a mathematical way. Well, if you're familiar with linear algebra, you've probably noticed something interesting about these two matrices. And that's that the number of columns of the first matrix or the Mel-filter bank matrix is equal to the number of rows of the matrix that represents the spectrogram. So when this happens, it means that we can do, apply, like, matrix multiplication. Okay. So in other words, applying Mel-filter banks to spectrogram, from a mathematical standpoint, means multiplying, doing matrix multiplication between the Mel-filter banks and the spectrogram. The result of this multiplication is the Mel-spectrogram. Now, I'm not going to get into the details of this matrix multiplication, because it's outside the scope of this series. But if you're interested and you don't know that much about linear algebra, I have a video over here that you can check out, where I talk about basic operations with, uh, like, matrices, and one of those is actually matrix multiplication.

[25:42]Okay. So, now, uh, let's take a look at the Mel-spectrogram. So the Mel-spectrogram is a metric, a matrix itself, and its shape is given by the number of bands, so we have, like, the, as many number of, as many rows as the number of bands that we chose. And on, the number of columns that we have is equal to the number of frames of the original spectrograms. Okay. So, in other words, if you think about this, like, this, uh, multiplication here, so applying Mel-filter banks to spectrogram, what enables us to do is basically to convert the frequencies just from hertz to Mel-bands, okay? And that's the whole point of a Mel-spectrogram, and the cool thing now is that we are, we are using, uh, Mel-bands which are kind of psychologically-relevant in terms of the way we perceive pitch, okay? And that's the whole point of a Mel-spectrogram. And the cool thing now is that we are, we are using, uh, Mel-bands which are kind of psychologically-relevant in terms of the way we perceive pitch, okay? Okay. But now, let's take a look at what this means in terms of, like, a visual representation. And, well, if you're familiar with spectrograms, well, Mel-spectrograms are no different at all. Uh, so you still have a, uh, a heat map, and that's because we have, like, a, uh, a matrix, right? And on the x-axis, we have, like, time, on the y-axis, once again, we have frequency, but this time, like, the different frequency bins are not, uh, like, the linear bins that we use with the spectrogram, but rather, the, uh, Mel-bands. Each, uh, frequency bin is a different Mel-band, which is perceptually-relevant. And then, each, uh, point that we have here, is, has a, an associated color that represents how present a certain Mel-band is at a certain point in time.

[27:43]So basically, the conceptual understanding of visualization of Mel-spectrograms is the same that we had for spectrograms. The only thing that really differs here is the way we represent a frequency, which is, like, psychologically-relevant this time, differently from, uh, a spectrogram.

[28:06]Okay. So, uh, the last question that we might ask here is why should we bother, like, learning about Mel-spectrograms? So, and the, the answer to this question is that they are extensively used in a lot of AI audio and AI music research. So, for example, Mel-spectrograms are overwhelmingly used, like, in automatic mood recognition, music genre classification, music instrument classification, and way more applications we have, like, for this type of spectrograms.

[28:49]Okay. But, uh, this time, we only focused on the theory of Mel-spectrograms, and I hope that by now you have like a clear understanding of, like, what Mel, mels are, the Mel-representation, and why, like, Mel-spectrograms are different than spectrograms, and why they are very convenient, uh, like, to use in a lot of like the applications where, like, we're talking about audio and music. And so, the next time, we'll be moving to actual, some kind of, like, implementation. So we'll be, trying, we'll be extracting Mel-spectrograms, uh, from audio files using Python and Librosa, and Librosa has a lot of, like, very nice utility functions that we can use that will save us a lot of time. And then we'll visualize Mel-spectrograms, and then we'll also try to extract and visualize Mel-filter banks, so that you can actually understand and see how we can, yeah, just, like, represent them and what they actually represent. Okay, before I just sign off and I finish, uh, I want to remind you about the sound of AI Slack community. And there, if you have, like, any questions about, like, this stuff, your projects, and way more stuff regarding, like, AI audio, AI music, you can just ask. And there you'll have, like, a community of people interested in AI audio, AI music, audio processing, that are ready, like, to help you, ready, like, to share, like, their ideas, their projects. So I really suggest you to go check out this community, and I'll leave you the link to, uh, sign up to this Slack community in the description below. Uh, it's all for today. I hope you really enjoyed the video, and I guess I'll see you next time.

MORE TRANSCRIPTS

Thumbnail for Assamese story/Assamese gk/Assamese motivational gk story/gk story Assamese/gk /Assamese g k/story by মই পাৰিম Motivational speech

Assamese story/Assamese gk/Assamese motivational gk story/gk story Assamese/gk /Assamese g k/story

মই পাৰিম Motivational speech

Thumbnail for Quick Guide to Aquatic Planet Pack DLC - Oxygen Not Included by Grind This Game

Quick Guide to Aquatic Planet Pack DLC - Oxygen Not Included

Grind This Game

Thumbnail for Трейдинг с нуля: объяснил ПРОСТО каждую деталь by КриптоБош

Трейдинг с нуля: объяснил ПРОСТО каждую деталь

КриптоБош

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript