[0:00]Hi everyone, this is GKCS. In today's video we'll see some of the commonly used terms in the AI space. If you're an engineer who's building applications, then you'll find these terms useful when communicating with people within your team or outside. And I think if you know these terms, then it's also easier to learn the deeper subjects around AI. So by the end of this video, you'll have a list of terms whose definitions you understand quite well, and I'll also be linking some references in the description so that you can dig into them further. Let's start. The first term that you should know about is Large Language Model, also known as LLM. And the definition of this is a neural network that is trained to predict the next term of an input sequence. For example, if I pass in the query, All that glitters, to a large language model, then it's going to come up with a response of 'is not gold'. At which point, the complete response of all that glitters is not gold is returned to the user. What do we mean by training, what do we mean by neural network, as we go through this video you will be understanding these terms better, one by one. Okay, the second term that we are looking at is tokenization. This has to do with processing the input of a large language model. For example, if "all that glitters" is passed into a large language model, the first thing it's going to do is break this into discrete tokens. Okay, that's the process of tokenization. The first token will be all, then there's a space character. You have then that, after which you have glitt and finally 'ers'. You might think, whoa, shouldn't you just break this into space characters and get the job done? Humans don't talk like that, we are after all trying to process natural language. So 'ers' is a common term. Shimmers, murmurs, flickers, these are terms which have the suffix of 'ers', which means that the action of glitt is being performed by that object. Another example of this is 'ing'. So eating, dancing, singing all have the suffix of 'ing'. And a large language model can look at this token of 'ing' and know that the previous action is being performed as long as you have this suffix. Okay? Remember, the core problem for a large language model is to truly understand human language so that it can speak it really well. Tokenization is an essential part of that, whose end result is that the input text is broken into tokens. Which brings us to our third term. Vectors. Tokens tell you what you should focus on.
[3:04]What is the smallest term that you can derive meaning from, but what meaning has to be derived is represented by vectors? If the large language model can map a two dimensional or an n-dimensional space, such that all the words which are close in meaning are placed close to each other, then the benefit will be that the meaning of these words will be turned into a coordinate in this n-dimensional space. This is called a vector. The coordinate, the mapping of a word in a n-dimensional space, such that nearby words, similar meaning words are all clustered together and opposite meaning words are somewhere far away.
[3:49]Comes through the process of vectorization. The end result of this is that large language models know the inherent meaning of all the words that are in the English vocabulary. And they also know how to break it into small tokens, any input text into tokens. Words which are similar to each other are placed close to each other. Once they know the meaning, they can construct sentences effectively. Okay, so now you have large language models, which can tokenize input text, convert them into vectors. But there is one major challenge which actually changed the entire industry here, which made large language models very popular, and that is Attention Mechanism. We just said that all the input tokens for a large language model are converted into vectors. The vectors encapsulate the meaning of those words. But what about the word Apple? When you say it's a tasty apple, you mean the fruit, the edible apple. When you say Apple's revenue, you probably mean the company. And if you say, the apple of my eye, you're probably talking about a young person who you have affection for. So Apple has different meanings, and the only way to understand the meaning is not by looking at the word itself because that spelling is the exact same, but by looking at nearby words which add context to the meaning of Apple. The moment I said tasty, you know that it's some sort of food that he's going to talk about. That's how humans derive meaning. And large language models can derive meaning this way now. The way they do this is look at nearby words in a sentence. Generate those vectors. So nearby contextual vectors are picked up, and for ambiguous terms, you end up with ambiguous vectors. But you can derive their exact meaning by adding this nearby contextual vector to it. So take the vector of Apple, take the vector of revenue. When you add these two vectors, when you perform some sort of an operation, it's not a direct addition but it's the attention operation. You effectively take the vector of Apple and you push it in the direction of the company Apple. So Google, Meta, Microsoft are all here. The first operation of vector revenue is going to send it there. If you instead add a vector of tasty to this, if you perform the attention mechanism of these two vectors, then it's going to push the vector of Apple to banana, chikoo, and guava. Okay, so you can tokenize input text, you can derive the inherent meaning of all of those tokens, and for ambiguous tokens, for tokens which are difficult to understand, you have a mechanism to add context by looking at nearby words. And this is another breakthrough that large language models have made. This was in 2017, the paper came out then, but in 2022, this became really, really famous with ChatGPT being released. The quality of responses of a large language model far exceed anything else that we have seen earlier. Okay, because it is able to derive contextual meaning, it's able to construct sentences in a way that humans speak. Okay, so now we know how LLMs can process input. But how do you train the LLM to predict the next token? Okay, here's where there was a major breakthrough in 2017. Uh, basically the concept of self-supervised learning became very popular. Self-supervised learning means that instead of telling the model exactly what it needs to do, the structure of the input data is such that the model knows what it should do.
[8:04]Okay, for example, you're watching this video right now, I'm going to make a part of this video blank. So, one, two, three, four, five. What do you think is being hidden right now? What number is coming to your mind? Let's see if that's right. Yes. Most of you guessed one because we went in the sequence, 5, 4, 3, 2, 1. Okay, but when it comes to a video, you can also do something else. Let me make another part of the video blank right now. Where do you think the other eye is looking? Let's check. Most of you got it right. Both eyes are looking upwards. So, what's happening is a section of the input can be predicted even if you make that section blank. Which means that there is inherent structure in your input, which your mind is able to replace with the expected token or expected output. Now, the standard way to train such a model would be called supervised learning, where you would have a human being say that if the input text is all that glitters, then the model should predict is not gold. If the input text is Et tu, then the output should be Brutus. Instead, self-supervised learning has made getting test data much cheaper. Here, if you have Et tu Brutus, then the model is going to be fed in this text, and it's going to make three predictions. One, what comes after Et, two, what comes after Et tu, and three, what comes after Et tu Brutus? Okay, no humans are involved. You had some text in the world, maybe you scrape this off the internet, and now you're telling the model, look, I have three questions for you. Tell me what are the right answers. So the model looks at these three puzzles, they are all running in parallel, and they try to make predictions. So, Et, the model might say na, the model might say tu, the model might say something. But you train the model that tu is the expected response. So if it makes a mistake, then you penalize the model, that increases loss, and so the neural network weights are updated. In the second task you have Et tu. If the model makes the prediction of Brutus, then you tell the model that this is great. The weights don't need to be updated. But if it says Caesar, then the model has to be penalized, and so the internal weights are updated. In the third case, if you predict a stop token, that Et tu Brutus. That's it. Then you'll get it wrong. If it's a comma, you're right.
[11:17]And if it's then, maybe you're also right. Okay, what you're doing is you're looking at text which already exists in the world, and you're creating multiple challenges for yourself without human intervention. This is what makes the model self-supervised. It might seem like a small thing, but this architectural decision or this benefit of the large language model makes it really, really scalable. In fact, most AI models now are moving to self-supervised learning. Even image models like we discussed are looking for removing some patches of the image and trying to predict those patches. The benefit of this is you understand the underlying structure and the inherent meaning of those patches. In the case of text, it's going to be terms. In the case of images, they are a bunch of pixels, and in the case of video, you might understand how an object even moves. Okay, so that explains what self-supervised learning is. Next is the Transformer. Okay, and most people confuse Transformer with Large Language Model, which is completely understandable actually, but that's not the case. A large language model is something which predicts the next token, given an input sequence. A transformer does the exact same thing, but it's a specific algorithm or a specific method by which you predict the next token. A transformer basically is input tokens being run through an attention block, which is then forwarded to a neural network, a feed forward neural network, and then you have a bunch of outputs. Okay, you can think of these as output vectors. These vectors are then passed into another layer of attention. The first layer of attention like we said, disambiguates terms. The second layer might find more complex relationships, it might find sarcasm, it might find implications. For example, a crane was hunting a crab. So, in the first case you understood it's not the metal crane, it's a bird crane. But in the second one you might infer that the crab is fearful. You might understand the crane is hungry. So, this is the second layer, and then you have another feed forward neural network and so on. Till finally you are confident enough to generate an output. Okay, so you have these stacked, sometimes they are stacked to 12 layers. Sometimes more. I think recent GPT architectures are in hundreds. The main idea behind this is you're getting all of the meaning from your input tokens and then manipulating them again and again to finally predict what the next word should be. This attention block is order N squared. You could replace this transformer in a large language model with something else. Tomorrow a new architecture could come in, in which case the transformer and the state space models are gotten rid of, which could be a diffusion model that constructs essays or text. Okay, so the large language model is actually the product, you can think of it as a car, and this is the engine. A car, many people say, is just the engine, but no, there are some other fancy things around it. The internal algorithm can be different. Term number seven, it's fine-tuning. We said that a large language model is something that is trained to predict the next term of an input sequence. The question is what type of next token are we talking about? If you're talking about a medical large language model, something which helps doctors explain the diagnosis of a patient, then you're probably going to be thinking of medical terms. If you have a model which is trained on financial operations, then the same model for the same query is going to think in terms of financial terms. So, the next token that the model comes up with is not always going to be general. You're first going to train your base model in a self-supervised fashion. Then you're going to take that model and make it go through a series of questions and answers. This process is called fine-tuning. And it goes something like, who is the president of USA? Donald Trump. "I don't know." But the model could also say, I would like to know that too. Here's where things are going wrong. Okay, the model should not be responding like this. Give us a direct answer, or confess that you don't know. Or you could say no, but this is also very, very bad because the models are trained to be helpful. Okay, so what's happening is other plausible responses which are not wrong, but are not desirable, are penalized in the fine-tuning process. You have these questions and answers. The fine-tuning process forces the model to take a question and give answers as expected. So when it comes to a medical diagnosis, the model is going to train itself. The internal weights will be updated in such a way that it will learn to speak in medical jargon or medical terms. And so this step where a base model is trained to answer in a specific way is called fine-tuning. The same base model can be run through different sets of questions and answers to come up with multiple fine-tuned models. The base model of Llama can be fine-tuned by a company to answer their customers' specific queries. Few-shot prompting. So the main idea behind few-shot prompting is before you send a query to a model, before you send a plain Vanilla query to a large language model and ask it to come up with a response.
[17:25]You augment the query, you add more information by saying, look, if the query is where is my parcel, then let me tell you that there are some examples that I want you to go through. This is happening during inference time, during response time, in production, right, live. Your system, your server, sends the original query and sends examples to the model so that it takes this into context and then gives an appropriate response. The quality of the response goes up. This is called few-shot prompting. It's basically example prompting. Example in prompt. That's it. It brings us to point number nine, which is very interesting and has completely exploded, which is Retrieval Augmented Generation. In fact, the AI space is moving so quickly that people are saying RAG or Retrieval Augmented Generation is already dead. So, the basic idea again is that you have a large language model and you pass in the input from the server. So, a customer connects to you here, they hit your API, the server says, you know what, this is the customer query. Let's forward that to the large language model. Along with that, let's give some examples. So that's few-shot prompting. And along with that, since there are some company policies that I want you to know of, large language model, I'll give you those documents. So, in real time the server goes, fetches the most relevant documents, maybe your policy document, maybe your terms and conditions when placing an order, and maybe many more things. Right, you send these documents along with examples of how you should respond. This gives you a good idea of the format of the response. This gives you a good idea of the company specific context, and this is the direct user input query. Okay. With all of this, the large language model tends to give very high quality responses. Now the question is where are you getting these documents from? How does the server know which documents are related to which query? There are many ways to do this, if you talk to Neo4j, which is a graph database company, they'll tell you you should store things in a graph DB. If you talk to Neon, then they'll tell you that you should store things in a vector DB, and some people will say just keep everything in memory, just keep everything in cache. This doesn't matter, how you fetch the documents doesn't matter so much, usually it's a vector DB by the way because... But I mean, it is easier to find relevant documents because you just do a similarity search. Once you have the documents, you pass that to a large language model. The large language model converts it internally into vectors, and then gives you a response. Okay, but at a high level you just want to add more and more context. You retrieve the context, augment the query, and then generate a response. The 10th term, which is Vector Database. We just mentioned Vector Database is something which is used to find relevant documents for an incoming query. Let's see how that happens. You have the request, I am upset with your payment system. I expect a refund. There's a lot of terms in this query. A human being can read this and easily understand what the user is feeling. They're feeling upset, I mean, they have clearly mentioned it. But they are looking for a refund. If you give them a refund, maybe the upset feeling will go away. Um, what do you do? Which documents do you search for? You could search for all documents where the word upset exists, but maybe you don't have it in your company policy. Maybe nowhere is it mentioned that a user is upset. But you have a document which mentions if the user is giving you a low rating, or if a user drops off. How do you make that decision that upset as a word is close to the low rating or drop off? We spoke about vectors. Vectors can encapsulate semantic meaning, which means documents which store similar words are going to be similar or close in distance. Remember, vectors are basically coordinates, right? So the distance between upset and documents having low rating are going to be low. You will fetch the documents which mention low rating or drop-offs and use them to add context to your large language model. When you have an incoming query from the user, you're going to find which document is closest to the query. And add that to the large language model's context. So this document will be sent along with the original user query and maybe a system prompt. The response. Where are you going to store these documents? In a vector database, which helps you perform these similarity searches efficiently. Some of these algorithms are Hierarchical Navigable Small World. We have spoken about this in detail in the Interview Ready course. At the end of the day, the vector database is like a black box to you. You can store documents and you can quickly retrieve them when you need them. Great, so you can store internal company documents and information in a vector database to get context for a large language model. But what if the context exists outside your system? So, this challenge was met with Model Context Protocol.
[23:25]Okay, as the name suggests, it's a protocol or a way to communicate, to transfer context into a model. The basic idea here, I have made a detailed video on this, you can check it out, but the basic idea here is that you have a large language model which before receiving an incoming query from a user, has a client, an MCP client, model context protocol client, which forwards the initial query, user query. The LLM now makes a decision. It says that there may be external tools or databases that I want to connect to. The client gets to know of this and connects with external MCP servers. In one case, that might be Indigo.
[24:20]In another case, that will be Air India, whose MCP server can give you details around Air India. So, you can think of this as a wrapper for Air India's database. This as a wrapper for Indigo's database. As a response, you're going to get flight details from each of these airlines. Once you have the details, you can forward it to the LLM saying that, hey, along with the user query, and along with whatever system prompt or relevant context that I could get from my vector database, I'm also adding flight details, real time information from external servers, which you can now consume to come up with a decision. Okay, and the large language model at this point might say, okay, book flight number I.E. Indigo 1020, which then results in another API call to book on the MCP server of Indigo. Okay, the response final response is given to the MCP client. The client then forwards it back to the user. Result in customer satisfaction. Okay, you see that the user is no longer just able to get data. They don't have to do things themselves after being given a recipe. The recipe can be completely executed by the MCP client. Okay, so this makes LLMs a lot more powerful. MCP has picked up a lot of popularity now. Okay, so all of this put together is called context engineering. If you're an AI engineer, you have probably heard of this term. And basically, this is an encapsulation of many of the things that we have already discussed. We discussed few-shot prompting, which is giving examples. We discussed Retrieval Augmented Generation, which is getting relevant documents from a vector database, and using them to add context to a query. And using Model Context Protocol to hit external servers and perform actions as needed. When it comes to context engineering, these two new challenges that we are facing as AI engineers. One is user preferences. And the second is prompt summarization. You can call it context summarization.
[26:58]For example, you might use a sliding window where the last 100 chats are sent directly to their large language model. And all the previous chats are summarized into five sentences. This limits the maximum amount of chats that you are sending to the large language model.
[27:24]You could use other techniques also, for example, some people just focus on keywords. Some people focus just on the last chat, so one chat and the previous entire chat history summary together. The idea is to get context summarization this way. When you get a document, you again summarize it first and then send it. So this can be done maybe using a cheap small language model or a distilled model. And once you have generated the context, you send that to the expensive large language model. You see the main difference between context engineering and prompt engineering is prompt engineering is for one single prompt. It is stateless. Anytime you ask the large language model to behave in a particular way, the system prompt is going to be the same. But context engineering evolves as per the users' declared preferences, and also the previous chat history. Similar to what it was earlier, but this is more long term.
[29:20]Which brings us to the most hyped term here is reinforcement learning. It's a way in which you can train models to behave in particular ways. So, for example, if you give a query, a user query, to the model, the model can generate two responses. Response one and response two. You must have seen this in ChatGPT. Choose the one which is better. Okay, so the one which is chosen gets a plus one, the other one gets a minus one. What happened effectively is you took a user query. This entire thing can be mapped to a vector, and the vector is an n-dimensional space, right? So you go to that coordinate, and you tell the model that, look, after reaching here, you generated further tokens, further vectors. So that's your path, you went from here to here to here. So, this was the final point of response. And now, you got a score of plus one. So this gets a score of plus one. This also gets a score of plus one, plus one, plus one, plus one, plus one. There's also discounting that you can do, but for now, let's just keep things simple. This is a nice path. You always want to follow this path. Response two was bad. There you followed this point to this point, this point, and then you deviated.
[30:48]The next token that you generated after the first three tokens, let's say is not gold, and then you did a comma here and went, but it may be. So, token one, two, three, four, token one, two, three, four. This was bad, it got a score of minus one, which means this area gets a score of minus one. This also gets a score of minus one, minus one, minus one, minus one. Minus one, plus one takes it to zero. Minus one, plus one takes it to zero. Minus one, plus one takes it to zero. So, what you're doing is you have a space where you have negative scores, positive scores, and neutral scores. If you do this enough, then you'll end up with a space, a vector space, where given an input query, given a starting point, you'll have a space of negative where you don't want to go. You'll have a space of positive where you definitely want to go. And the more positive it is, the more you want to go there. Okay, so maybe you go here, from here you have another very positive space which is over here. This is like hill climbing, right? You're basically trying to optimize on the path that you're taking as a large language model. The expectation is that the final result will make the end user happy. Okay, if the end user experience is good, then the model is trained to make users happy. That's what is reinforcement learning with human feedback. Human feedback is telling you whether it is a plus one or minus one. And the feedback is helping you reinforce good outputs. This is an extremely powerful technique. In fact, it is seen in nature if you know about Pavlov's dog, then there was this situation where Pavlov would press a bell and give food to the dog when it would come after pressing the bell. Eventually, he realized that if he just presses the bell without giving food, the dog already comes and starts salivating because it's expecting food. So, it's behaviors have been reinforced. Fortunately, this is not the only capability that human beings have. You cannot model human intelligence using just reinforcement learning. I'll take an example. Let's say you have a coin which is giving you heads, heads, heads, heads, heads, heads. If you know that this is a fair coin, if you have a mental understanding of how the coin works, then what do you think is coming next? Heads or tails? Okay, with what probability? Okay, so, I just looked at the camera and said, okay, okay, twice. Something's going on, but as a human being, you should look at this and say, if it is a fair coin, if it's an unbiased coin, then it can be heads or tails. You can't guarantee that it is going to be heads next. But reinforcement learning looks, it observes the real world and based on that makes a decision. So when it predicts heads, it gets reinforced, great job. When it predicts tails, it gets punished, bad job. But the reality is this is a fair coin. So there's a 50-50 chance of either. If you ask a human being, you show them the coin, you tell them that this is a fair coin, and then you just keep flipping the coin, you get a lot of heads, they're just going to say 50-50. Because they have a internal representation of how the coin works. They have a mental model of the physics of the coin, while reinforcement learning cannot build mental models. They can just tell you, based on outcomes, what is more likely and what is maybe a more beneficial path. Okay, we are not crocodiles, we are humans. We have a deeper understanding of how things work. Having said that, reinforcement learning is a powerful technique. It does make models get smarter, uh, quite smart, right? Chain of Thought. Pretty simple concept but very powerful. When training the model, we'll clearly explain our thought process here. The expectation is that as the model trains to break a problem step by step, it's going to look at newer problems with different parameters and still be able to reason through them. Because it has been trained to reason step by step. This is called Chain of Thought, where the model goes through a series of deductions or inferences and comes up with the final response. The quality of this response is usually much higher than a direct response. You can see this is similar to few-shot prompting, the quality of the response is higher, uh, it has some examples to go through, but here the key difference is that there is a step by step breakdown and new steps can be added by the model as it sees fit. Because it is trained on so much training data, it may be able to reason through, add more steps as the problem gets more and more difficult. Okay, in fact, this is something that has been seen by DeepSeek. If you make the problem harder, it goes for more steps, if you make the problem easy, then it goes for fewer steps. So this is called a reasoning model. Okay, they don't necessarily need to do chain of thought, they can also use other algorithms. For example, there is tree of thought, graph of thought, also that you can go through. You can use tools also to come up with better reasoning, but a model that can reason, a model that can figure out given a problem, how to solve that problem step by step is a reasoning model. These are also known as LRMs. Okay, examples of this, DeepSeek, and OpenAI, I mean, the O1 and O3 and other O series models. But there are newer models with new capabilities now, multi-modal models. Okay, so the basic idea is that most large language models that we know of operate on text. But what about models which can accept and create images, generate images? What about models which can accept and create videos? So, they can analyze images, they can tell you the number of apples in an image, let's say, or they can modify an image to create a new image. Similarly for video. These have tremendous application, similar to how large language models have changed the marketing space. Through textual content, now social media is rife with large language model content. Images are going to get better and better. And video can be a really big deal because if you have celebrities who can create video, who can create ads, through large language models, then the cost expectation of creating videos is going to go down. Okay, this is already happening to some extent, but the quality of the models are not very good. Multi-modal in general means any kind of mode of input data. It turns out that their performance is better than models which are just trained on text. Okay, they have a deeper understanding of the meaning of objects. Um, if you train a model on cat and feline and so on, and then if you show it images of cats, then the performance of the model, the output quality is usually better. Okay, the training is better. Finally, let's get at three major topics, which is where the AI space is heading. Okay, people are looking for more company-specific smaller models, foundation models. The reason for this is companies want more control over what they generate, they also want to keep the data close to themselves, they don't want to expose it to any other third party company. So, one of the things which is happening is we are looking at smaller models. Small language models, uh, as you can expect with the words, have fewer parameters than large language models. For example, a small language model may have 3 million to 300 million parameters. Okay, the neural network internally has fewer connections, fewer weights. But if you look at large language models, contrast it, you have 3 to 300 billion parameters. Okay, so there's a very large neural network with a lot of weights in a LLM, but a SLM is smaller. But they are useful because they are trained on lesser data, which can be company specific or task specific.
[39:37]For example, a bot which is trained on just customer queries, how to manage customer queries, how to make sales, is likely to perform decently well. Okay, it's going to be an expert at sales. But it probably can't tell you a detailed weather analysis. For most companies, this doesn't matter. In the case of NASA, this is what you need. You're probably not selling anything. Hopefully, maybe you are, who knows? But NASA would be more interested in building a foundation model, which can predict the weather, but not bothered about the sales part. So, in this way, smaller language models are being trained by companies on their specific data, on their proprietary data, to come up with reasonably good responses for specific use cases. And the process of building small language models is usually distillation. The basic idea is you have a large language model, which is a teacher, and then you pass in some input. You look at the output of the large language model, and in parallel, you also send it to a small language model. Okay, with fewer parameters here. And it also tries to predict the output. So the teacher produces an output, and the student tries to mimic the teacher. If these two outputs match, then the small language model is doing well. No weights need to change. But if it's not doing well, then the internal weights of the small language model are changed. But there is a limited number of weights assigned to this model, 3 to 300 million. What you're basically trying to do is condense this information, the the complex neural network, into the most reasonable representation that you can have, such that your performance is okay, but the costs are significantly reduced. So, during run time, during production, inference time, when you get a query, this is going to be much faster at responding as compared to this large language model. Okay, it's also easier to host. Okay, distilled models take us to the last term that you really should know if you're an AI engineer, and that is Quantization. Here the idea is that you have neural networks. Each of these weights is basically a number, let's say a 32-bit number. What if you could take these weights and condense that information into eight bits?
[42:20]Then 75% of your memory is expected to be saved. It doesn't directly map over here because the weights are usually just done on the feed forward neural network. You still have the attention mechanism, and also the training cost is the same because initially you come up with a really good model with zero quantization. Once the model is completely trained, that's when you apply quantization. So the training cost doesn't reduce. This is mainly to reduce inference cost, or during production, the cost of running a model. So these are the most important 20 terms that I wanted to discuss in the AI engineering space. I think knowing these terms will help you effectively communicate with any other AI engineer or people in the team. I couldn't go into enough detail here because, I mean, when you're talking about the attention mechanism or KV caching, you can't do this in a 20-30 minute video. But the things you should know about are these terms, and also most of the things that I've mentioned in the AI engineering course at Interview Ready. If you know them, then you truly understand how these models work and all of the hype and nonsense which is going on in this space, they become hype and nonsense to you, right? You're able to recognize it much better. Thank you for watching, I hope you enjoyed the video. I'll see you next time, bye bye.



