[0:05]Hello everyone, welcome to this lecture in the build Large Language Models from Scratch series. Up till now in this series, we have looked at a number of things. In particular, we have spent a lot of time discussing the data preparation and the sampling stage of building a large language model. In particular, if you want to build an entire large language model pipeline, it has to be done in three stages. In stage one, you have to look at the data preparation, then comes attention mechanism, then comes the LLM architecture. In stage two, we have the training and the model evaluation and in stage three, we have the fine tuning. So we have spent around five to six lectures on the data preparation and sampling part, where we looked at word embedding, we looked at tokenization. We looked at bite pair encoding, we looked at positional encoding. Essentially, we have looked at the entire data pre-processing pipeline of the LLM in a lot of detail. Now, it's time to move to the second building block of stage one and that is the attention mechanism. In one of the earlier lectures of this series, I told all of you that I think of transformers as the secret sauce behind the LLM. So, if transformer is like a car, then attention mechanism is essentially the engine which drives the car. This is the mechanism, I think, which gives so much power to large language models and that's why chat GPT performs so well. There are a few lectures on attention mechanism on YouTube but they are not comprehensive at all. It's impossible to cover everything related to attention mechanism in one lecture. It's one of the most important concepts. So I've planned a series of four to five lectures on attention mechanism. Today's lecture will be a foundational overview, where we will understand the introduction to attention mechanism. What it is, why it is really needed, and the types of attention mechanism. Then what will be doing from the next lecture onwards, is will be coding out the entire attention mechanism completely from scratch. We are not going to assume even a single thing. Okay. So, in this lecture, we are going to look at this sub-section, which is essentially the sub-section on attention mechanism. So let me switch my color to I think I'll choose purple here and let's start looking at attention mechanism in detail. So first, let me motivate so that you get an intuition of why this name attention mechanism comes and what are we essentially trying to solve here. So let's look at this example. Let's say you are a large language model like GPT and you have received this sentence. The sentence is, the cat that was sitting on the mat, which was next to the dog, jumped. Now, as a human, I can say that, okay, there is maybe a cat here. The cat was sitting next to a dog. And the cat was also on a mat and then as a human when I read this, I know that the cat jumped. Okay. But as a large language model, if you look at the sentence, you'll soon realize that this sentence is a bit confusing. I can very clearly see that the cat was sitting on the mat. If only this were the sentence, I could easily analyze that the cat is the main subject in this sentence and the cat was sitting and the object is the mat. But the thing is, when there are such complex sentences, which are also called long term dependencies, where there is the second sentence which is attached.
[4:00]So then it becomes a bit difficult for the large language model, because uh so after this sentence, there will be a number of other sentences, right? But the main thing which the large language model needs to really understand from this sentence is that the cat, which is the main subject, that subject actually jumped. So the main, the action which that subject performed is jumping. So the LLM should understand that when it looks at cat, the word which it should be paying the most attention to is jumped. Notice how I use the word attention. So when I look at the word cat, of course sitting is also important because the cat was earlier sitting on the mat but now the cat has jumped. So there are a few words in this sentence, which the LLM needs to pay the most attention to in association with cat. And if you don't introduce the attention mechanism, it's very difficult for the LLM to know that the cat is the one who has jumped. Maybe if the attention was, attention mechanism was not there, the LLM would have been confused and it might think, oh the dog has jumped or it might think that the main, main part of this sentence is the cat is on a mat. So if the attention mechanism was not there, maybe the LLM would have thought that the cat is on the mat, that's it. It would not know that I have to give a lot of attention to jumped in association with the cat. This is the broad level intuition why we need to learn about the attention mechanism. When you have sentences such as this and then there is a big story after this, the LLM needs to analyze this sentence and it needs to process in relation to a particular word. Let's say in relation to cat, which other word should I pay the most attention to and that's where the attention mechanism comes into the picture. It turns out that without attention mechanism if we use a recurrent neural network or some other neural network, it does not capture the long term dependencies between sentences. That's the broad level intuition. Now let's dive deeper into what all we will be covering about attention in the subsequent lectures. So, if you look at the attention mechanism itself, there are essentially four types of attention mechanism. The main attention mechanism which was, which is used in GPT, generative pre-trained transformer and all the modern LLM's, is this multi-head attention. And many YouTube videos and courses all, many courses just directly start with multi-head attention. It's a very difficult concept to understand if you directly start learning this. So you have to go in a sequential manner. So what I'll be covering in these sequence, in this series of lectures is first I'll start with something called simplified self attention. So this is the purest and the most basic form of the attention technique so that you understand what is attention. Then we will move to self attention. So here we will also introduce train train aable weights which form the basis of the actual mechanism which is used in LLM's. Until this part we are still not at the actual mechanism, but we are building up slowly. After I cover self attention, the next thing which I'll move to is causal attention. This is when things really start to get interesting. We are predicting the next world, right by looking at the past word. So what causal attention does is that it's a type of self attention that allows the model to consider only the previous and the current inputs in a sequence and it masks out the future inputs. No need to pay too much attention to this right now. I am just giving you a broad overview of what all I'll cover in the subsequent lectures when we look at attention. Today we are not going to cover all of these. Today we are just going to look at more details about the history of how attention came into the picture, why it is needed, why it's better than RNN's, etcetera. And then finally we'll move to multi-head attention. Only when you have understood causal attention and self attention and simplified self attention, you will be able to understand multi-head attention. This is the main concept which is actually used in building GPT. So multi-head attention is just basically a bunch of causal attention heads stacked together and we'll code out this multi-head attention fully from scratch. I'll show you the dimensions, how they work, etcetera. All of that is planned in the subsequent lectures. So this multi-head attention is essentially an extension of self attention and causal attention that enables the model to simultaneously attend to information from different representation sub-spaces. Don't worry about this, just remember that the multi-head attention allows the LLM to look at input data and then process many parts of that input data in parallel. So for example if this is the sentence, the multi-head attention allows the LLM to have let's say one attention head looks at this part. One attention head looks at this part, one attention head looks at this part, etcetera. This is just a crude description so that you get an understanding of what do you mean by multi-head attention. So I just wanted to show you this overview so that you get an idea of how these four to five lectures are actually planned. It is impossible as I mentioned to cover all of this in one lecture and that's why I will follow a very comprehensive approach. I'll show everything on the whiteboard and then I have this Google Colab notebook where everything has already been implemented and we'll go through this entire notebook. See, hiding future words with causal attention and then I also have a section on essentially multi-head attention. Yeah, see. So at the end of these four to five lectures, we'll be implementing this multi-head attention in Python and code it out from scratch. Okay. For now, let's continue with today's lecture, which is an introduction to the attention mechanism and how researchers got to discovering attention. So let's go back in time a bit, because to always appreciate something new, we need to know about the history of how, of how we came to this innovation. So, let's go right at the start where we are modeling long sequences. So we have one sequence in English and let's say we want to translate it to the German language. So what's the problem in modeling long sequences? So let's look at this question, what is the problem with architectures without the attention mechanism which came before LLM's? So for reference, we'll start with a language translation model. So let's look at this, this figure here. So I have words in the English language. So can you or let's say, I have the German input sentence which I want to translate in English. So this is the input sentence, I have the first word, the second word, then I have the third word, etcetera. And I want to translate this into English, okay. So, uh, let's say we do a word by word translation. If I translate the first word to English, it's can. If I translate the second word to English, it's you. If I translate the third word, it's me. The fourth word help. So if you translate every German word word by word, the translation comes out to be, can you me help this sentence to translate? That's obviously not correct, right? So the main takeaway here is that the word by word translation does not work. And you can also see this in Hindi. So if the main text is in English, so can you help me? And if you want to translate it in Hindi, so the Hindi translation is Kya tum meri madad karoge? So Kya is associated with can, that's fine. You is associated with tum. But meri is the third word in in the Hindi translation, right? But it's actually the fourth word in the English translation. Similarly, madad is the fourth word in the Hindi translation, but help is actually the third word in the English. So the main point here is that word by word translation does not work in this case. And that was a major realization when people started modeling long sequences. And this is a general problem when you deal with sequences. You cannot just do word by word translation. You need contextual understanding and grammar alignment. So whenever you are developing a model, let's say, which translates one sequence to another sequence or tries to find the meaning of a sequence or makes the next word prediction from a sequence, you need to really understand the context. You need to understand how different words relate with each other, what's the grammar of that particular language and only then will you be able to process sequences or only then you will be able to model long sequences of textual information. That's understanding number one. Okay. With this understanding, what people realized is that we cannot just use a normal neural network. Because if you have a normal neural network, it does not have memory. So we are going to use this word memory a lot. Just like humans have memory, we store information about the past. In order to do a good job in sequence to sequence translation, the models need to have a memory. The models need to know what has come in the past. Why? Because let's say I have a sentence that Harry Potter went to station number 93x4. He did this, he did this, etcetera. And then when I come to a sentence which is three to four sentences after the first sentence, which is Harry Potter came to station 93x4. I should not forget what came before because the station number 93x4 is very important for me to know even if I come at the end of the paragraph. So if I'm making some prediction at the end of the paragraph and the word station comes over there, I need to go back to the start. I need to have memory of what came at the start, that it was the station number 93x4. And this happens a lot with textual data. If you want to have meaningful outcomes in terms of text summarization, next word prediction, language translation, you definitely need to have understanding of the meaning. And for that, you need the model to retain the memory. So to address this issue that word by word translation does not work in this particular case of translation, people realized that a normal neural network will not work. So they augmented a neural network with two submodules. The first sub-module is an encoder and the second sub-module is a decoder. So what the encoder does is that in in the example we saw, it will receive the German text. It will read and process the German text and then it will pass it to the decoder and then the decoder will translate the German text back into English. This is the simplest explanation of the encoder decoder and there is a nice animation here, which actually shows how the encoder decoder works. So here you can see the input sequence comes in the German language, it goes to the encoder. A context is generated by the encoder. It's called as a context vector. The context vector essentially captures meaning. So it has memory and it captures meaning of, okay, instead of just word by word translation, what does this sentence represent? And the encoder processes the entire input sequence and sends the context over to the decoder. So let me play this again. The input sequence comes to the encoder. It generates a context vector, which basically encodes meaning. And then the encoder transfers the context vector to the decoder and the decoder generates the output. In this case, the output is the translated English text. Okay. This is how the encoder decoder blocks work. And the mechanism which really employed the encoder decoder blocks successfully is called Recurrent Neural Networks. So before really transformers came into the picture, Recurrent Neural Networks was, was that architecture, which was extremely popular for language translation and it really employed the encoder decoder architecture. It was implemented in the 1980s. So let's look a bit more at how the RNN actually works. Because when we, if we understand how RNN works, that's when we'll understand the limitations of recurrent neural networks and that's when we will really appreciate why the attention mechanism needed to be discovered. So here is how the encoder decoder in the RNN actually works. What happens is that you first receive an input text. Okay. And that let's say is the German, German text. The input text is passed to the decoder to the encoder. What the encoder will do is that it at every step, it will take the input and it will maintain something which is called as the hidden state. This hidden state was the biggest innovation in the recurrent neural network. This hidden state essentially captures the memory. So imagine the first input which is the first German word comes. The encoder augments it or the encoder maintains a hidden state. Then you go to the next iteration. Then the second input word comes. Then the hidden state also gets updated. So as the hidden state gets updated, it receives more and more memory of what has come previously. And the hidden state gets updated at each step and then there is a final hidden state. The final hidden state is basically the encoder output. What we saw the context vector over here. So when we looked at the context vector which is passed from the encoder to the decoder, let's see over here, yeah. So here you see a context vector is passed from the encoder to the decoder. This context vector is the final hidden state. This is basically the encoder telling the decoder that hey, I have looked at the input text. Here's the meaning of this text. Here's how I've encoded it. Here's the context vector. Take this final hidden state and try to decode it. And then the decoder uses this final hidden state to generate the translated sentence and it generates the translated sentence one word at a time. So here's a schematic which actually explains this pretty well. So I have an input text here. So this is the first word in German, the second word in German, the third word in German, and the fourth word in German. What the encoder block will do is that it will take each input sequentially and it will maintain a different hidden state. So for the first input it has the first hidden state. Then we move to the next iteration. Then the second hidden state. Then the third hidden state. And then finally, when we have the last input, we have this final hidden state. The final hidden state essentially contains the accumulation of all previous hidden state. So it contains or encapsulates memory. This is how memory is incorporated, which was missing earlier with just a normal neural network. So this is the final hidden state and then this final hidden state essentially memorizes the entire input and then this hidden state is passed to the decoder. And then the decoder produces the final output, which is the translated English sentence. So I want to show you another animation of this so that you understand it much better. So here's how the RNN actually works, right? So see, the input vector number one is the first word in German, which needs to be translated. So here you will see in this animation, how the hidden state gets updated. So the first word of German comes, then the RNN maintains a hidden state zero. And here you see, the hidden state is the hidden state zero and the input one is used to produce the output one and then we also have a hidden state one. Then as we move further, we have the hidden state two, hidden state three, hidden state four and a final hidden state when the last word needs to be processed. And let's look at the hidden states for the encoder. Notice how the last hidden state is actually the context we pass along to the decoder. Actually, I can show you this again here. So this is from a French to English translation using the recurrent neural network. Look, look here. So we have a French input which is coming here, just a Swede, etudiant. Then, uh, yeah, so the first word goes into the encoder. See, now we have a hidden, so let me expand it. So, and play from the start. Okay. So now the first word of French, which is Jo, goes into the encoder. It has went into the encoder right now and the first hidden state is generated. See in the orange color. Great. Now this hidden state number one and the second input is used again. Look at this animation again. The hidden state number one and the second input, which is sui SUI is used and then we have the hidden state number two. Then the hidden state number two and the input three, which is etudiant, will be used to produce the final hidden state. Great. This final hidden state, hidden state number three, essentially contains all the information in the given sequence, plus it also contains some memory or some context regarding uh, what came in the past. And now this final hidden state is then passed to the decoder and the decoder produces the output in English one word at a time. This is exactly how the recurrent neural network works. Okay. Now, you might think that, awesome, right? This is already doing sequence to sequence translations and we are translating from one language to another language. So why do we need attention mechanism? Memory is being encoded here and we are passing in the context, which means that we will be able to identify how different words of the sequence are related to each other. So why do we need attention? Well, there is a big problem with the recurrent neural network and that problem happens because the model, the decoder has essentially no access to the previous hidden states. So if you look at this video, you'll see that the decoder has access to only the final hidden state. So hidden state one, hidden state two, hidden state three. And then hidden state three is passed to the decoder. See, the decoder has no access to the previous hidden states.
[22:36]Now, why is this a big problem? The reason it's a big problem is because when we have to process long sequences, if the decoder just relies on one final hidden state, that's a lot of pressure on the decoder to essentially that one final hidden state needs to have the entire information. And for long sequences, it usually fails. Because it's very hard for one final hidden state to have the entire information. Let me explain this bit more. So, as we saw the encoder, let me change my color here. I think let me change it to green. Yeah. So, as we saw, the encoder processes the entire input text into one final hidden state, which is the memory cell. And then decoder takes this hidden state to essentially produce an output. Great.
[23:36]Now, here's the biggest issue with RNN and please play, pay very close attention to this point because if you understand this, you will understand why attention mechanisms were needed. The biggest issue with the RNN is that recurrent neural network cannot directly access earlier hidden states from the encoder during the decoding phase.
[24:03]It relies only on the current hidden state, which is the final hidden state. And this leads to a loss or this leads to a loss of context, especially in complex sentences where dependencies might span long distances. Okay. So let me actually explain this further. What does it mean loss of context, right? So, as we saw, the encoder compresses the entire input sequence into a single hidden state vector. I hope you have understood up till this point. Now, the problem happens, let's say if the sentence if the input sentence is very long, if the input sentence is very long, it really becomes very difficult for the recurrent neural network to capture all of that information in one single final hidden state. That becomes very difficult and this is the main drawback of the RNN. So for example, let's take a practical case. So let's say, we take the example which we looked at at the start of the lecture, which is, the cat that was sitting on the mat, which was next to the dog, jumped. And let's say we want to convert this English into a French translation. Okay. So the French translation will be le chat, uh, whatever. I cannot spell this out fully but this will be the French translation for this English sequence. Now as I mentioned to you before, this English sequence is pretty long. The RNN or the encoder really needs to capture the dependencies very well. So the final hidden state needs to capture that the cat is the subject here and the cat is the one who has jumped. And this information, this context needs to be captured by the final hidden state. And that is very hard if you are putting all the pressure on one final hidden state to capture all this context, especially in long sequences. So, uh, the key action, which is jumped depends on the subject cat, but also on understanding the longer dependencies that the cat was sitting on the mat, next to the dog.
[26:20]So, jumped. The action jumped, of course, depends on the subject cat, but we also to understand this, we also need to understand longer dependencies that the cat was sitting on the mat, next to the dog and the cat was also sitting on the mat. So to capture these longer dependencies or to capture this longer context or difficult context, the RNN decoder struggles with this. Because it just has one final hidden state to get all the information from. This is called loss of context. And loss of context was one of the biggest issues because of which RNN was not as good as the GPT which exists right now, which is based on the attention mechanism. Okay. So these are the issues with RNN. The decoder cannot access the hidden states of the input which came in earlier. So we cannot capture long range dependencies. This is where attention mechanism actually comes into the picture. Okay. We will capture long range dependencies with attention mechanisms and let's see how. So RNNs work fine for translating short sentences, but researchers soon discovered that they don't work for long texts as they don't have direct access to previous words in the input.
[27:40]So when an RNN decoder only receives the final hidden state, right? They don't even have access to the, the decoder does not have access to all the prior words which came in the input. So let's say I'm decoding, let's say I'm looking at this word jumped, right? Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. So essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[28:43]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[29:32]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[30:21]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[31:10]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[31:59]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[32:48]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[33:37]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[34:26]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[35:15]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[36:04]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[36:53]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[37:42]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[38:31]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[39:20]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[40:09]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[40:58]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[41:47]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[42:36]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[43:25]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[44:14]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[45:03]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[45:52]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[46:41]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[47:30]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[48:19]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[49:08]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[49:57]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture. Essentially using an attention mechanism, the text generating decoder part of the network can access all input tokens selectively.
[50:46]So let's say you're looking at jumped, right? Let's say I'm looking at the word jumped. Cat is a word which has come way prior in the sequence. So when I'm looking at the word jumped, I need to give a lot of attention to the word cat. But an RNN gets the entire encoded version of this sentence. So how would the RNN know that jumped actually, if you're looking at jumped, you should pay a lot of attention to the word cat? It does not even have access to this input vector for cat. This is where attention mechanism actually comes into the picture.



