[0:00]Hi everybody, and welcome to a new exciting video in the audio signal processing for machine learning series. Last time we extracted the discrete Fourier transform with Python from a bunch of audio files. This time, we're back to theory, specifically, we'll be addressing a key topic in AI audio, the short-time Fourier transform. So why is the short-time Fourier transform so important? Well, that's because it enables us to extract spectrograms and spectrograms are really the, probably the most important feature that you can feed to deep learning audio models. But before we get into the in and outs of the short-time Fourier transform, I want to remind you once again about the sound of AI Slack community. On this community, you'll find people with interests in AI music, AI audio, audio digital signal processing. And so, if you're there, you can ask for feedback, you can share your projects and network with a bunch of very cool people. So if you're interested in joining, I'll leave you the sign-up link in the description section below. Okay, now on to the real cool stuff. So, before we get to the STFT, I want to remind you about the discrete Fourier transform. And here we have its mathematical formulation. Now, I'm not going to get too much into the details here, that's just because I have a whole video on the discrete Fourier transform, so if you're interested, you can go and check that out. But what we need here is the high-level intuition. So, we start with our signal in the time domain, so a waveform like this. Then we apply the discrete Fourier transform, and what we get back is basically a picture of the presence of the different frequency components in the original signal. And visually, we get like a magnitude spectrum like this, but this is a still image. What do I mean by that? Well, it's a still image in the sense that it only provides us like one picture that averages the presence of the frequency components across the whole duration of the signal. And here we actually have a problem, because we know what um, frequency components are present in a signal, but we don't know when they are more or less present. Because all of them are averaged across the whole duration of uh the entire signal. And this is a little bit of a problem because we know with audio data, it's all about the um evolution of frequency components like over time. And so, audio data is very, very dynamic, and we want to know how like these different frequency components evolve over time. And this is the whole point of the short-time Fourier transform. So moving from a still image to a video that provides us information about the um different frequency components across time. So how can we do that? Well, that's the whole point of the short-time Fourier transform, and the high-level idea here is that we don't perform the Fourier transform across the whole duration of the signal, but rather, we consider a small segments or chunks of the signal, which technically we call frames, and then we apply a discrete Fourier transform for each frame. Now, I know this could sound a little bit like abstract, so let's visualize this. So we start with a signal, audio signal like this. Then we consider only like the first chunk, the first frame, and at this point, only on these samples belonging to this frame, we apply the discrete Fourier transform, and we get back like our nice magnitude spectrum. Then we slide on to the next frame and once again, we apply to that frame the discrete Fourier transform. Third frame, same thing until we bounce through all the duration of the signal. One way we can use to derive the segments is through windowing. In other words, we apply a window function to a signal. What does that mean? Well, it means that we take the original signal, and then we multiply that by a window function sample by sample, and we obtain a windowed signal. Now, this feels a little bit abstract, doesn't it? So let me give you an example. So we start from a, an audio signal, and then here we're going to be applying a rectangle windowing function. And this is the result. So the rectangle window function is this red curve here and yeah, it has like a rectangle shape, right? And this function is zero everywhere apart from a segment where it is equal to one. So if we multiply this signal with the rectangle window, we obtain these windowed signal down here. Now, I want to introduce a couple of parameters that are very important for what we are discussing today. So one is the window size, the other one is the frame size. They're both measured in number of samples, but they refer to two slightly different things. So, let's take a look at the window size first. So, the window size is basically the amount of samples we apply windowing to. The frame size, on the other hand, is the kind of like the, the number of samples that we consider in each chunk of the signal when we segment the, the signal, and then we pass it to the, the short-time Fourier transform for just like calculating the Fourier transform for each frame, for each segment. Okay. Usually, the window size and the frame size coincide. They are, they have the same value, the same number of samples. But sometimes it happens that the frame size is larger than the window size. Now this is like quite unusual, I would say, and most of the time, most of your applications, the window size and the frame size will coincide. And this is like so uh, so true that, for example, in Librosa, when we um extract the short-time Fourier transform, we we're not forced to pass the window size, and the default value for the window size is the frame size, right? Okay. But what happens if the window size is smaller than the frame size? Well, still we the chunk, so we, we apply the Fourier transform to is the whole frame, but then the windowing happens only on the window size number of samples, right? And the remaining samples, which are the difference between the frame size and the window size, are going to be zero padded. Okay, but for the sake of this video, we'll assume that we have the window size, which is equal to the frame size. So if that happens, what this is, at this point, is already the, um, so we, we have like one frame here, and we've also applied the window function here. And at this point is when we apply the, uh, Fourier discrete Fourier transform so that we can get the frequency components out of this frame. Then we move on, we slide to the right and we get like a second frame like this, sliding also like the window function. And here, once again, we apply the discrete Fourier transform. We move to the third frame. Same thing until we get to the end of the signal. But this is a simplified version of what usually happens in a short-time Fourier transform, because the frames are overlapping like this. So the second frame is overlapping with the first one as you can see from here. Now, I need to introduce another parameter here, that's called hop size, or capital H, and it's given by this visually. And this basically provides us, or tells us how many samples we slide to the right when we take a new frame. Okay. If you want to know why um the hop size like is so important and we need overlapping frames, I really suggest you to go check out my video on audio feature extraction pipelines. It's up here, and there you'll learn about a lot of topics about, for example, like spectral leakage and a bunch of other things that are related to like the points that I'm making here, and I'm not going to get into the details here, because I've already done that. Okay, moving on. It's time to move from this kind of like visual intuition of the short-time Fourier transform to its mathematical formulation. So what I want to do here, and don't be scared, is to compare the digital Fourier transform, or its mathematical form, which is this formula in the top with the formulation for, mathematical formulation for the short-time Fourier transform, which is this one down here. Okay, so let's go item by item and see how they map to each other. Okay, the first one is just like the definition, right? The output that we get. So in the case of the discrete Fourier transform, we get like this X hat as a function of K, where K is a proxy for a frequency. And um, so in other words, like the, uh, discrete Fourier transform depends on the frequency. And this means that uh, each of these like formulas is going to give us a complex Fourier coefficient for the K frequency. And the complex Fourier coefficient provides us information to, about two parameters, the phase and the magnitude. Now, if you want to really know what phase and magnitude mean in terms of the Fourier transform, I suggest you to go check out these video on uh the Fourier transform, and there you'll get like a better picture, but I hope like you've already watched that. Okay. So this is for the discrete Fourier transform. What about the short-time Fourier transform? As we can see here, a capital S depends not only on K, on frequency, but also on M. So what's this M? Well, M is a proxy for time, so that the short-time Fourier transform depends both on frequency and time. So now let's understand a little bit better like what this M is, and uh, like nominally, uh, is just like the, the frame, the, the frame number we are currently in. But let's visualize this. Okay, so here we have like we're back again with our original signal and like the different frames here. And so, for the first frame, we have M, which is equal to one. Here we have for the second frame M equal to two and M equal to three, you get the idea, right? So M is just like the frame number. So in other words, uh, the result that we get from the short-time Fourier transform is the Fourier coefficient for the K frequency at the N temporal bin or N frame. Okay, but still, like the, the Fourier coefficient that we get, it's still like a complex number that has information about phase and magnitude. Okay, moving on. So the next step is to compare these two sums. If you guys remember from the DFT, so what was happening here is we were summing all across like all the samples. So basically, we were summing across like all the time, all the duration of the signal. And we do like something similar also in the STFT, kind of like intuitively, we are doing the same thing. We are summing like across time, or given like we are in a discrete domain, across like all the different samples. But what's different between these two sums is capital N. In the case of the discrete Fourier transform, we are summing across all of the samples in the signal. So N is equal to all the samples in the signal. In the case of the short-time Fourier transform, capital N over here is equal to the frame size. And that's because we're not considering all of the signal, but just one frame, and in one frame, we have a number of samples that's equal to the frame size by definition, right? So now you start to get the idea of how like this STFT works. But to see this like even more specific, we need to move to the next element of this, uh, formulas, which is the signal itself. So let's analyze the one in the top formula. So for the discrete Fourier transform, X of N is just like the signal considered on, I mean, the whole signal. So all of these samples in the short-time Fourier transform, by contrast, we're only considering the signal that's present like in the current M frame. So in other words, we're considering all the samples that are present in the current M frame. And so, why is that the case? Well, that's the case because, first of all, this M multiplied by capital H is the starting sample of the current frame. Because M is the current frame and H capital H is the hop size. So if you multiply two, you realize that this is like the starting sample of the current frame. And then we add N, but this N moves like from zero to the frame size minus one, which basically means like that we are kind of like covering all the samples in that frame, okay? Now, if we want to visualize this, we can just go back to the signal. And here, like we have like this rectangle, and here we have like all of the the signal like for one frame. Uh, and here, like on the left of this rectangle, this vertical, uh, line here is the starting sample of the frame, which is equal to M multiplied by capital H. Okay. So now, uh, we, like, the next step, though, in the case of the short-time Fourier transform is that we should multiply this signal by the windowing function.
[16:12]And again, we already saw this, right? And so we are multiplying the original uh, signal like for a specific frame by uh, the windowing function, and we obtain the windowed signal.
[16:31]Okay, for that one frame. Okay, so moving on. We have the last step, and the last step is the same for both the discrete Fourier transform and the short-time Fourier transform. In other words, we are multiplying by a pure tone that has frequency given by K divided by capital N. And so by doing so, what we're doing is we're taking the, uh, the signal, uh, and then we are decomposing it, and projecting it onto the, uh, pure tone with frequency, um, K divided by capital N. Okay. So here you have the comparison between the math behind the discrete Fourier transform, and the math for the short-time Fourier transform. But now you may be wondering, okay, now, more or less like I get like the math here, but what are like the outputs? So what, what do we get out of a DFT and an STFT? Let's take a look at that. Okay, for the DFT, we extract a spectral vector with, uh, for a number of like frequency bins. In other words, like we, we get a Fourier coefficient for each of the frequency components with, uh, decomposed our original signal into. And this is a one-dimensional array. It's just like a vector, right?
[18:03]And there's no mention of time in here, because everything like is averaged across the whole duration of a signal. But with the STFT, we have like something that's quite different. In this case, we don't have a one-dimensional, uh, value, one-dimensional array, but rather, a two-dimensional array, or in other words, a spectral matrix that has a number of frequency bins and a number of frames.
[18:47]And so, in other words, we have both a reference to frequency, as we had with the discrete Fourier transform, but now we've gained also information about time through the different frames, which are proxies for time. Okay, but, um, you may be wondering, but can we calculate the actual number of frequency bins and number of frames that we get out of an STFT? Well, yes, of course we can, and we'll do that. Okay. So, how do we get the number of frequency bins? Well, this is quite easy to get. And, uh, you have the formula here. So you get the frame size, you divide the frame size by two, and then you add up one. And this gives us the number of frequency bins.
[19:35]Now, let's try to understand why this is the case. If you guys remember from my earlier video on the discrete Fourier transform, you should know that the number of frequency bins that we get out of a discrete Fourier transform is equal to the number of samples that we have in the, in the, in the whole signal. Now, in the case of an STFT, we don't, uh, average, uh, we don't consider the whole samples at once, but rather like a frame size number of samples.
[20:10]So we would expect that the number of frequency bins, uh, for each Fourier transform that we get is equal to the frame size. But we don't get that. We get the frame size divided by two. Plus one. Why is that the case? Well, if you remember, once again, from the discrete Fourier transform video, we saw that the discrete Fourier transform is symmetric, has a mirror symmetry around the center frequency, which is the Nyquist frequency.
[20:42]And what happens there is that basically like the, the first half, uh, has some information, and then that gets like mirrored in the second half. And so, in a short-time Fourier transform, we, we consider that, and so we don't need to take information about all of these bins, because it's just like redundant. We only take information about like the first half. So frame size divided by two plus one. So that's the reason why. Now, if you haven't fallen, that's completely, I highly suggest you to go check out my video on the discrete Fourier transform to understand what I meant more specifically there. Okay, now let's move on to the number of frames. And so here we have another very nice little formula. And the number of frames is given by the total number of samples that we have in the signal, minus the frame size, divided by the hop size plus one. Now, I'm not going to get into the details of explaining this visually, and I highly suggest you as an exercise to play around with this and understand why this formula gives us the number of frames. Okay. But I know this can feel a little bit abstract, so let's go move on with an example. So here we have like a bunch of like uh, STFT parameters. And we want to find the actual output shape. So we have a signal with 10,000 samples. We have a frame size, which is equal to 1000 samples, and we have a hop size of 500 samples. So, uh, for the number of frequency bins, so we take the frame size, we divided it by two, we add one, and we get 501 frequency bins.
[22:21]Okay, but these are frequency bins, and we know that, uh, they divide a certain frequency range equally. And so we have like a frequency range that's divided in 501 bins in this case. Uh, now, what, what is that range? Well, that range, the frequency range is between zero hertz and the sampling rate divided by two hertz. And that is the Nyquist frequency, once again. Uh, so if you want to know why that's the case, once again, just go back to my video on discrete Fourier transform. Okay, moving on. The number of frames. So here we have like a little formula, and so we have to take the number of samples in the signal, so 10,000, minus the frame size, and this is going to be divided by the hop size, and to all of this, we have to add one, and the result is 19. So we have 19 frames this signal is going to be divided into. So the overall output shape of the STFT in this particular case is going to be 501 and 19. So it's a two-dimensional array. The first dimension, uh, provides us information about frequency, the second, uh, provides us information about the temporal, uh, bins, or the number of frames. Okay.
[23:59]So now, uh, I think like we should, uh, take a look at the short-time, um, Fourier transform and try to understand the different parameters. So the important thing that you should understand here is that really, the short-time Fourier transform depends on a bunch of parameters that we pass. So one of these parameters, and we've already encountered it, is the frame size, or in other words, uh, how big are the chunks we divide our original signal into. And this is measured in frames.
[24:33]And the usual values that we have here are like, like this, like 512, 1024, 8,192. As you can see, these are power of two numbers. And as we already discussed in a previous video, it's important that the frame size is a power of two number. And that's because, um, with that specific number, we can use the Fast Fourier Transform to calculate the discrete Fourier transform, which is a very quick and computationally efficient, um, way of extracting the discrete Fourier transform. Now, there's an interesting aspect in, when we choose the frame size, and it's called the time-frequency trade-off. So, if we get like a larger, a large frame size, what usually, what happens is that the frequency resolution is going to, uh, increase, and the time resolution is going to be degraded. So where's that the case? Well, so we know that if we, uh, enlarge the frame size, so we take more samples, we're going to be having like more frequency bins. And so, if you have more frequency bins, it means that your frequency resolution overall improves. But if you take more, more samples in one frame size, it means that you are taking like a larger, you're considering a larger chunk of time, because samples like are proxies for, for time, okay? And in other words, if you're taking like like a larger chunk of time, it means that the time resolution goes down, right? And the opposite is also true. In other words, if you take a smaller frame size, then the frequency resolution is going to go down, and that's because you're going to have like a smaller number of frequency bins as output. But you're going to have a higher, a better time resolution, just because you're considering less samples, which equates like to, uh, like a less amount of time. And so, you're going to be calculating like the, the Fourier transform like on smaller chunks of time. So your time resolution is going to be better. Now, this is like a time-frequency trade-off, as you can see here. So when you are trying to improve the frequency resolution, then the time resolution is going to go down, and vice versa. Now, how do we solve that? Well, we don't really solve that. We just have like some heuristics. Most of the time, you want to find a value of the frame size that's okay, and it's a good trade-off between frequency and time resolution. Um, but this really depends on the type of application that you're, um, or problem that you are interested in. So certain problems, for certain problems, it's more important that you have a higher frequency resolution, and in that case, you should take like a bigger frame size. For other applications, like, for example, onset detection, you're not really super interested in the frequency, frequency resolution. Perhaps you're more interested in just like having like a very precise, or like high, highly resolved like time, um, so that you can really know what happens like at each point in time. Okay. So I think like this is like very important to keep in mind when you decide like which frame size like to, to take, because I mean, the two things, frequency and time resolution are, uh, related, uh, together and inversely, uh, related in a sense, right? Okay. Now, let's move on to the next, uh, short-time Fourier transform parameter, and that's the hop size. We already saw that, multiple times in this video and in earlier videos, and we know that's the number of samples that we slide to the right when we want to take a new frame. So, usual values here, once again, 256, 512, 1024, 2048, 4096. And we can also define these as a fraction of the frame size, so a half of the frame size, or a fourth, or an eighth of the frame size. So you have like both definitions, absolute and relative. Okay. Now, a third very important parameter is the windowing function. Obviously, the short-time Fourier transform is not only like a function of the signal itself, but it's a function also of the windowing function that we choose, because different windowing functions are going to kind of like modulate the original signal in different manners. And then this is going to have like an effect on the short-time Fourier transform results. Okay, so we introduce the rectangle window function, but that's not really used at all in, um, in digital signal processing, and that's because like it creates, uh, discontinuities like on, uh, the edges like all of the windows. Rather, to avoid those, you want to use like a bell-shaped curve, one of which, the most important, probably, is the Hann window. So 90% of the time, probably you're going to be using the Hann window when you perform a short-time Fourier transform, perhaps without even knowing that. And so this, uh, function is given like by this, uh, formula here, which is obviously like a periodic, uh, formula, a periodic like function over here. And here like you have visualization of this. So, um, so let, let's see this like in action. So here we have like a signal, here we have like our bell-shaped Hann window. So when we apply the Hann window to the signal, you see that the signal gets modulated, and towards the end, the, um, the values like of the of the samples tend to get squashed like towards zero so that we avoid discontinuities like on the edges like all of the windows. Once again, if you want to know why that is so important, you should go check out my video on audio feature extraction pipelines where I talk about spectral leakage. Okay. So now let's move on to the final topic of this, uh, video. And this is like what you probably came here for, and that's the spectrogram. So through the spectrogram, we can visualize sound. So, but how do we get to the spectrogram? Because up until now, we know that we have the short-time Fourier transform, and that's a matrix that has complex numbers or Fourier coefficients for each item in the matrix. So what we do is we take the squared magnitude of the short-time Fourier transform, and what we get is a matrix which has the same shape as the original short-time Fourier transform. But the difference is that now we have all, all of the items are not complex numbers anymore, but they are real numbers. And now we can visualize them using a heatmap. And the visualization is called a spectrogram.
[32:21]And this is, I mean, this is like so important for all applications in AI audio, because like so many times we're going to be using spectrograms as features that we feed into the algorithms. So now let's take a look at the spectrogram here. So on the X-axis, we have time, and these are like discrete times, and you can see it here that you have like these tiny like discontinuities. And these are like all the frames, all the temporal bins. And on the Y-axis, we have frequency, which all of the different frequency bins. And so what we are seeing here is how the different frequency bins, how the different frequency components evolve over time across the different, um, frames that we have in the original signal. And so, and now this is actually the dream that we wanted to come true. So now not only we have information about the frequency components, which was something that we already had with the spectrum, but we also have information about the components evolving over time, which is the information that we usually get from the time domain.
[33:43]And this is why spectrograms are so important in AI audio. Now, I'm not going to get into the details of implementation and all of these things like for spectrograms, because that's the, um, topic of the next video. So in the next video, we're going to be using Python and Librosa specifically, for extracting spectrograms. We're going to be looking into different flavors of spectrograms, and understand which ones to use, and then we're going to be examining like different audio samples and comparing them, perhaps like different musical genres and how like they, their spectrograms differ. Okay. So, I hope like you've found, uh, this video instructive and useful. If that's the case, please consider leaving a like. And if you haven't subscribed yet to the channel and you want to see more videos like this, please consider subscribing. So if you have any questions, please leave them in the comment section below. I think that's all for today. I'll see you next time. Cheers.



