[0:00]Hey everyone and welcome back. This is my second shot at recording for this series. Apologies for the technical difficulties last time, my iPad recording was getting turned off more frequently than my ex-girlfriend whenever I tried to make sexual advances. But the great thing about this series is that I can really crank these out. I'm not going to finish that joke off. Uh because I have pretty much already made them all before. So as you can see, I'm making my second one of the day, which is great. So, um, you know, if you recall from last time, we started to get into, uh, the contextualization of the database problem. How we want to actually organize our data, that we can read fast, write fast, keep it persistent. And so today, what we're going to do is take a look at the actual internals of a database and how we want to put our data on the disk in order to maximize performance. All right, welcome back everybody. I have now gotten the iPad and I'm ready to do some work. You may see this is a different day. I'm just coming home from the office right now and I have to take a shit very badly, so let's get this done with relatively quickly. So we established basically three facts about the data, last video that we're first going to talk about. Number one is the following, that we want the data to be stored persistently. Number two, even more importantly, is that we want to be doing this on a hard drive. That kind of follows from persistence. We know that RAM does not store data between computer shutting off and on, so it's more durable on a hard drive. The third thing is of course that the hard drive is in a database. Or at least we're running database software on some computer and the actual data itself is being stored at that hard drive. So before we go ahead and move further, let's go ahead and look at our hard drive in more depth. So I'm going to draw another picture of it. This time a little bit bigger, so hopefully that helps. We've got basically some rectangle with a metallic disc and then some moving arm that goes to the proper place on disc in order to read or write data. So generally the point of this is that because this little arm has to actually move around the metallic disc, the closer that you put data together on disk, the faster it is to access. So for example, you know, if I have one piece of data here and the other here, that's kind of bad. Instead, it would be better if, you know, we had our first piece of data here again and the second one right next to it. We're more likely to get better read speeds.
[3:22]Okay. So, now let's actually go ahead and start thinking about, you know, the data is going to just look like an array on disc, right? We have all of our little bytes that we can write down and slots. And so again, you know, the closer the data is to one another when we want to access it together, the better. So that's how I'm going to preface this entire lesson, but we'll see this come into play more and more. So let me go ahead and scroll down a little bit.
[4:00]So now let's imagine a website, right? Where basically all it does is it stores a bunch of names of people and their shoe sizes. And every time you load up the website, it'll show you all those people in their shoe sizes. So I'm now going to draw up an example database table for this website. So you can see we're going to have a few rows of people and their respective shoe size. So the first person is going to be me, Jordan, and you know, my shoe size is 18 because, you know, big shoes means quite a little bit about what else is going on in your body and we know that, uh, I'm packing. Number two is going to be Donald, for Trump. He has small hands so he probably has small shoes as well. We're going to go ahead and throw him a seven. And then we've got Shaq, for Shaquille O'Neal. That dude is massive, probably has like a size 24. I can concede to Shaq. So what is my point about this table? Well, the first thing is, let's say I want to find rows where name is Jordan. This is something you can do in most databases.
[5:30]Right? In order for me to go ahead and do that, I have to check each row individually. I have to go here, then row number two, then row number three. And what that means, when you have to do proportional work to the number of rows in the actual table itself is that this is going to be an O of N time complexity. Right? If there are N rows, we have to do N amounts of work basically, or work proportional to the N rows in order to actually find the row that we care about. So, in addition to that, let's say I wanted to actually edit a size. Turns out Shaq is bigger than we thought. His shoe size is actually going to be 26. So again, if we're going to write, first what that means is that we have to actually find the proper row, right? We're going all the way through these rows again, and then we're going and changing our existing row from a 24 to a 26. So that means O of N time complexity for both reads and writes. Okay. So, let me quickly erase this 26 and throw back our 24. Because what I'm now going to show is a slightly better design of a database table that would basically allow us to improve at least our write complexity. So, let's say this time instead of editing Shaq, I want to edit Donald. So we knew normally that in order to make that edit, we would actually have to find the row, Donald in the table and edit it. But now, what I'm proposing is instead of actually doing any fiscal edits, what we do instead, sorry, I don't know how to use this iPad yet, is we would go ahead and add a new row to the table, where I would now just say Donald with the new shoe size. Let's say his foot is actually even smaller than we realized. Now, it's a six. And then every time we search for a row now, we actually just go ahead and search from the bottom of the table up. So in the case of Donald, I would say, 'Oh, it's got to be this row here, not this one, because this Donald is actually lower to the bottom. And so now, because we know that we're always going to be writing in the same place every time, which is the bottom of the table, we can basically, uh, keep a pointer to that address in disc, and that makes right times O of one. Uh, however, this does come on the trade-off of a worse read time. The reason for that is that even though reads are still O of N, it means that, uh, there are going to actually be more rows in the table overall, so that N is going to be bigger. Obviously, this isn't going to be acceptable, as O of N reads are basically slower than Joe Biden trying to read from a teleprompter, and he is a boomer, so it's not ideal. But the truth of the matter is, actually, in a lot of these problems, or a lot of these big websites, we care a lot more about read speeds than write speeds. And at least as far as I've just showed you, those read speeds were pretty unacceptable. O of N is not good enough, right? We're going to cross that out. I'm just go ahead and say it right there. The reason O of N isn't good enough is because when we have hundreds of thousands, or even millions, or possibly more rows in our database, we don't want to have to literally loop through every single one in order to find the proper row that we want. We need to make things faster than that. Think about Facebook, for example. This is the example I keep using. If I have to go through every single post in the database, in order to go ahead and load our newsfeed, that is going to be impractical and it simply doesn't make sense. We need a better way of actually organizing our data. And this is where the need for indexes come in. So, before I actually explain what an index is, let's think about some things that an index, or just a better way of reading from databases in general, could help us do. Well, for one, we care about finding rows with specific key values, right? So in the last example, uh, basically that was finding a row by a name. Sorry, I can't write, terrible handwriting. In the last example, it was finding a specific row, uh, with a name that matched the parameter that I provided. Um, but in another example, you know, maybe it could be names, maybe we care about finding a row with a specific shoe size, that could help as well. And we want to be able to do this faster than just by linearly scanning through that whole table. Another possibility is performing something known as a range query. So we gave some, uh, names before. An example of a range query would be, show rows with names between A and B, right? So, if there are, you know, Abraham Lincoln in there, uh, you know, Alex Adams, anyone like that, you want to go ahead and return all those rows. And this is a very common query. If you think about it, in something like a social media site, maybe we want all posts from, you know, one hour ago until now. Right? So it's basically any sort of range query where we have these kind of two delimiting parameters and we ultimately want to get every single thing in between. So these are very common cases where we care about optimizing them. Even if it means potentially taking a penalty in our write speeds, right? Because at the end of the day, if we're doing a lot more reading than writing, something in the database that can help us improve our read speeds, even if it makes writing more expensive, is still going to be a big win in the average case for the user experience. And so ultimately, what we want is going to be called a database index. A database index, like I've started to hint at, basically does the following. Basically does the following. The read speeds go up and the write speeds go down. And this is for a specific key. Right? So we can have a database index on one field, for example, like, uh, names in that table, but, uh, you know, when you mention an index, typically it is on one field at a time. You can create multiple indexes, uh, where there is an index on both the name and the shoe size. You could even potentially have an index on both the name and the shoe size. But we'll kind of explain what all of that is in the next video, where I am going to show you one example of how to implement an index and kind of the pitfalls and the benefits of that specific implementation, which is going to be a hash index. All right, everybody, thanks again for all the support on the first video of this series. Again, I'm sorry about the flickering, but uh hopefully we can get that fixed up this time around. I am now going to go ahead and take a fat crap, and I will see you in the next video.



