[0:02]Hello and welcome to K-Nearest Neighbors Algorithm Tutorial. My name is Richard Kirchner and I'm with the Simply Learn team. Today we're going to cover the K-Nearest Neighbors, a lot to refer to as KNN. And KNN is really a fundamental place to start in the machine learning. It's a basis of a lot of other things and just the logic behind it is easy to understand and incorporated in other forms of machine learning. So today, what's in it for you? Why do we need KNN? What is KNN? How do we choose the factor K? When do we use KNN? How does KNN algorithm work? And then we'll dive in to my favorite part, the use case, predict whether a person will have diabetes or not. That is a very common and popular used data set as far as testing out models and learning how to use the different models in machine learning. By now, we all know machine learning models make predictions by learning from the past data available. So we have our input values, our machine learning model builds on those inputs of what we already know and then we use that to create a predicted output. Is that a dog? Little kid looking over there and watching the black cat cross their path. No dear, you can differentiate between a cat and a dog based on their characteristics. Cats. Cats have sharp claws, uses to climb, smaller length of ears, meows and purrs, doesn't love to play around. Dogs have dull claws, bigger length of ears, barks, loves to run around. You usually don't see a cat running around people, although I do have a cat that does that, where dogs do. And we can look at these and we can say, uh, we can evaluate their sharpness of the claws, how sharper their claws, and we can evaluate the length of the ears. And we can usually sort out cats from dogs based on even those two characteristics. Now tell me if it is a cat or a dog? Now a question, usually little kids know cats and dogs by now. Unless he lived a place where there's not many cats or dogs. So if we look at the sharpness of the claws, the length of the ears, and we can see that the cat has smaller ears and sharper claws than the other animals. Its features are more like cats, it must be a cat! Sharp claws, length of ears, and it goes in the cat group. Because KNN is based on feature similarity, we can do classification using KNN Classifier! So we have our input value, the picture of the black cat, it goes into our trained model, and it predicts that this is a cat coming out. So what is KNN? What is the KNN algorithm? K-Nearest Neighbors is what that stands for, is one of the simplest supervised machine learning algorithms mostly used for classification. So we want to know, is this a dog or is not a dog? Is it a cat or not a cat? It classifies a data point based on how its neighbors are classified. KNN stores all available cases and classifies new cases based on a similarity measure. And here we got from cats and dogs right into wine, another favorite of mine. KNN stores all available cases and classifies new cases based on a similarity measure. And here you see we have a measurement of sulfur dioxide versus the chloride level, and then the different wines they've tested and where they fall on that graph based on how much sulfur dioxide and how much chloride. K in KNN is a parameter that refers to the number of nearest neighbors to include in the majority voting process. And so if we had a new glass of wine there, red or white, we want to know what the neighbors are. In this case, we're going to put K equals five. We'll talk about K in just a minute. A data point is classified by the majority of votes from its five nearest neighbors. Here, the unknown point would be classified as red, since four out of five neighbors are red. So how do we choose K? How do we know K equals five? I mean, that's what was the value we put in there, so we're going to talk about it. How do we choose the factor K? KNN algorithm is based on feature similarity. Choosing the right value of K is a process called parameter tuning and is important for better accuracy. So at K equals three, we can classify and we have a question mark in the middle, as either a as a square or not. Is it a square or is it in this case a triangle? And so if we set K equals to three, we're going to look at the three nearest neighbors, we're going to say this is a square. And if we put K equals to seven, we classify as a triangle depending on what the other data is around it. And you can see as the K changes, depending on where that point is, that drastically changes your answer. And uh, we jump here where we go, how do we choose the factor of K? You'll find this in all machine learning, choosing these factors, that's the face you get. You're like, oh my gosh, you say choose the right K. Did I set it right my values in whatever machine learning tool you're looking at so that you don't have a huge bias in one direction or the other. And in terms of KNN, the number of K, if you choose it too low, the bias is based on it's just too noisy, it's it's right next to a couple things and it's going to pick those things and you might get a skewed answer. And if your K is too big, then it's going to take forever to process. So you're going to run into processing issues and resource issues. So what we do, the most common use, and there's other options for choosing K is to use the square root of N. So N is the total number of values you have and you take the square root of it. In most cases, you also, if it's an even number, so if you're using like in this case squares and triangles, if it's even, you want to make your K value odd. That helps it select better, so in other words, you're not going to have a balance between two different factors that are equal. So usually take the square root of N and if it's even, you add one to it or subtract one from it, and that's where you get the K value from. That is the most common use and it's pretty solid. It works very well. When do we use KNN? We can use KNN when data is labeled. So you need a label on it. We know we have a group of pictures with dogs, dogs, cats, cats. Data is noise-free, and so you can see here, when we have a class and we have like underweight 140, 23, hello Kitty, normal, that's pretty confusing. We have a high variety of data coming in, so it's very noisy, and that would cause an issue. Data set is small. So we're using working with smaller data sets where I mean, you might get into a gig of data if it's really clean, doesn't have a lot of noise. Because KNN is a lazy learner, IE, it doesn't learn a discriminative function from the training set. So it's very lazy, so if you have very complicated data and you have a large amount of it, you're not going to use a KNN. But it's really great to get a place to start. Even with large data, you can sort out a small sample and get an idea of what that looks like using the KNN, and also just using for smaller data sets KNN works really good. How does KNN algorithm work? Consider a data set having two variables: height (cm) and weight (kg), and each point is classified as Normal or Underweight. So we can see right here we have two variables, you know, true, false, or either normal or they're not, they're underweight. On the basis of the given data, we have to classify the below set as Normal or Underweight using KNN. So if we have new data coming in that says 57 kilograms and 177 centimeters, is that going to be normal or underweight? To find the nearest neighbors, we'll calculate Euclidean distance. According to the Euclidean distance formula, the distance between two points in the plane with coordinates (x, y) and (a, b) is given by: dist(d) = sqrt((x - a)^2 + (y - b)^2). And you can remember that from the two edges of a triangle, we're computing the third edge since we know the X side and the Y side. Let's calculate it to understand clearly. So we have our unknown point, and we placed it there in red, and we have our other points where the data is scattered around. The distance d1 is the square root of 170 - 167^2 + 57 - 51^2, which is about 6.7. And distance two is about 13. And distance three is about 13.4. Similarly, we will calculate Euclidean distance of unknown data point from all the points in the dataset. And because we're dealing with small amount of data, that's not that hard to do, and it's actually pretty quick for a computer, and it's not a really complicated maths. So you can just see how close is the data based on the Euclidean distance. Hence, we have calculated the Euclidean distance of unknown data point from all the points as shown. Where (x1, y1) = (57, 170) whose class we have to classify. So now we're looking at there were saying, well, here's a Euclidean distance. Who's going to be their closest neighbors? Now, let's calculate the nearest neighbor at K equals three. And we can see the three closest neighbors puts them at normal. And that's pretty self-evident. When you look at this graph, it's pretty easy to say, okay, what, you know, we're just voting, normal, normal, normal, three votes for normal. This is going to be a normal weight. So, majority neighbors are pointing towards 'Normal'. Hence, as per KNN algorithm, the class of (57, 170) should be 'Normal'. So a recap of KNN. A positive integer K is specified along with a new sample. We select the K entries in our database which are closest to the new sample. We find the most common classification of these entries. This is the classification we give to the new sample. So, as you can see, it's pretty straightforward. We're just looking for the closest things that match what we got. So let's take a look and see what that looks like in a use case in Python. So let's dive in to the predict diabetes use case. So use case, predict diabetes. The objective: Predict whether a person will be diagnosed with diabetes or not. We have a data set of 768 people who were or were not diagnosed with diabetes. And let's go ahead and open that file and just take a look at that data. And this is in a simple spreadsheet format. The data itself is comma separated, very common set of data, and it's also a very common way to get the data. And you can see here we have columns A through I, that's what one, two, three, four, five, six, seven, eight, um, eight columns with a particular attribute, and then the ninth column, which is the outcome, is whether they have diabetes. As a data scientist, the first thing you should be looking at is insulin. Well, you know, if someone has insulin, they have diabetes because that's why they're taking it. And that could cause issue when some of the machine learning packages, but for very basic setup, this works fine for doing the KNN. And the next thing you notice is it, it didn't take very much to open it up. Um, I can scroll down to the bottom of the data, there's 768. It's pretty much a small data set. You know, at 769, I can easily fit this into my RAM on my computer. I can look at it, I can manipulate it, and it's not going to really tax just a regular desktop computer. You don't even need an enterprise version to run a lot of this. So let's start with importing all the tools we need. And before that, of course, we need to discuss what IDE I'm using. Certainly can use any particular editor for Python, but I like to use for doing a very basic visual stuff, the Anaconda, which is great for doing demos with the Jupiter notebook. And just a quick view of the Anaconda Navigator, which is the new release out there, which is really nice. You can see under home, I can choose my application. We're going to be using Python 36. I have a couple different versions on this particular machine. If I go under environments, I can create a unique environment for each one, which is nice. And there's even a little button there where I can install different packages. So if I click on that button and open the terminal, I can then use a simple pip install to install different packages I'm working with. Let's go ahead and go back under home, and we're going to launch our notebook. And I've already, you know, kind of like the old cooking shows, I've already prepared a lot of my stuff, so we don't have to wait for it to launch. Because it takes a few minutes for it to open up a browser window. In this case, I'm going to, it's going to open up Chrome, because that's my default that I use. And since the script is pre-done, you'll see I have a number of windows open up at the top, the one we're working in. And, uh, since we're working on the KNN predict whether a person will have diabetes or not, let's go ahead and put that title in there. And I'm also going to go up here and click on cell, actually, we want to go ahead and first insert a cell below. And then I'm going to go back up to the top cell, and I'm going to change the cell type to markdown. That means this is not going to run as Python, it's a markdown language. So if I run this first one, it comes up in nice big letters, which is kind of nice. Remind us what we're working on. And by now, you should be familiar with doing all of our imports. We're going to import the Pandas as PD, import Numpy as NP. Pandas is the Pandas data frame, and Numpy is a number array. Very powerful tools to use in here. So we have our imports. So we've brought in our Pandas, our Numpy, our two general Python tools. And then you can see over here we have our train test split. By now you should be familiar with splitting the data. We want to split part of it for training our thing and then training our particular model. And then we want to go ahead and test the remaining data just to see how good it is. Pre-processing, a standard scalar pre-processor, so we don't have a bias of really large numbers. Remember in the data we had like number of pregnancies isn't going to get very large, where the amount of insulin they take can get up to 256. So 256 versus six, that will skew results. So we want to go ahead and change that so that they're all uniform between minus one and one. And then the actual tool, this is the K neighbors classifier we're going to use. And finally the last three are three tools to test. All about testing our model, how good is it? We just put down test on there. And we have our confusion matrix, our F1 score and our accuracy. So we have our two general Python modules we're importing, and then we have our six modules specific from the SK learn setup. And then we do need to go ahead and run this so that these are actually imported. There we go. And then move on to the next step. And so in this set, we're going to go ahead and load the database. We're going to use Pandas, remember Pandas is PD. And we'll take a look at the data in Python. We looked at it in a simple spreadsheet, but usually I like to also pull it up so that we can see what we're doing. So here's our data set, equals PD.read_csv. That's a Pandas command, and the diabetes folder, I just put in the same folder where my IPython script is. If you put in a different folder, you'd need the full length on there. We can also do a quick length of the data set. That is a simple Python command, LEN for length. We might even, let's go ahead and print that. We'll go print. And if you do it on its own line, length dot data set in the Jupyter notebook, it'll automatically print it. But when you're in most of your different setups, you want to do the print in front of there. And then we want to take a look at the actual data set, and since we're in Pandas, we can simply do data set head. And again, let's go ahead and add the print in there. If you put a bunch of these in a row, you know, the data set one head, data set two head, it only prints out the last one. So I use always like to keep the print statement in there. But because most projects only use one data frame Pandas data frame, doing it this way doesn't really matter, the other way works just fine. And you can see when we hit the run button, we have the 768 lines, which we knew, and we have our pregnancies. It's automatically given a label on the left. Remember, the head only shows the first five lines. So we have zero through four, and just a quick look at the data, you can see it matches what we looked at before. We have pregnancy, glucose, blood pressure, all the way to age, and then the outcome on the end. And we're going to do a couple things in this next step. We're going to create a list of columns where we can't have zero. There's no such thing as zero skin thickness or zero blood pressure, zero glucose. Any of those you'd be dead. So not a really good factor if they don't, if they have a zero in there, because they didn't have the data. And we'll take a look at that because we're going to start replacing that information with a couple of different things. And let's see what that looks like. So first we create a nice list, as you can see, we have the values we talked about, glucose, blood pressure, skin thickness. Uh, and this is a nice way when you're working with columns is to list the columns you need to do some kind of transformation on. Very common thing to do. And then for this particular setup, we certainly could use the, there's some Pandas tools that will do a lot of this, where we can replace the NA. But we're going to go ahead and do it as a data set column equals data set column.replace. This is this is still Pandas. You can do a direct, there's also one that says that you look for your NA. A lot of different options in here, but the NAN, Numpy NAN is what that stands for is is non doesn't exist. So the first thing we're doing here is we're replacing the zero with a Numpy none. There's no data there. That's what that says. That's what this is saying right here. So put the zero in and we're going to replace zeros with no data. So if it's a zero, that means the person's, well, hopefully not dead. Hopefully it just didn't get the data. The next thing we want to do is we're going to create the mean, which is the integer from the data set from the column dot mean where we skip NAs. We can do that. That is a Pandas command, there skip NA. So we're going to figure out the mean of that data set. And then we're going to take that data set column and we're going to replace all the NPN with the means. Why did we do that? We could have actually just taken this step and gone right down here and just replaced zero and skip anything where except you could actually, there's a way to skip zeros and then just replace all the zeros. But in this case, we want to go ahead and do it this way, so you could see that we're switching this to a non-existent value. Then we're going to create the mean. Well, this is the average person. So, if we don't know what it is, if they did not get the data, and the data is missing, one of the tricks is you replace it with the average. What is the most common data for that? This way, you can still use the rest of those values to do your computation, and it kind of just brings that particular value, those missing values out of the equation. Let's go ahead and take this and we'll go ahead and run it. Doesn't actually do anything. We're still preparing our data. If you wanted to see what that looks like, we don't have anything in the first few lines, so it's not going to show up. But we certainly could look at a row. Let's do that. Let's go into our data set. We'll just print a data set. And let's pick in this case. Let's just do glucose. And if I run this, this is going to print all the different glucose levels going down, and we thankfully don't see anything in here that looks like missing data, at least on the ones it shows. You can see it skipped a bunch in the middle. Because that's what it does. If you have too many lines in Jupyter notebook, it'll skip a few and and go on to the next in a data set. Let me go ahead and remove this. We'll just zero out that. And of course, before we do any processing, before proceeding any further, we need to split the data set into our train and testing data. That way we have something to train it with and something to test it on. And you're going to notice we did a little something here with the Pandas database code. There we go, my drawing tool. We've added in this right here off the data set. And what this says is that the first one in Pandas, this is from the PD Pandas, it's going to say within the data set, we want to look at the i location and it is all rows, that's what that says. So we're going to keep all the rows, but we're only looking at zero column zero to eight. Remember, column nine, here it is right up here. We printed it in here as outcome. Well, that's not part of the training data. That's part of the answer. Yes, column nine, but it's listed as eight. Number eight, so zero to eight is nine columns. So, uh, eight is the value. And when you see it in here, zero, this is actually zero to seven. It doesn't include the last one. And then we go down here to Y, which is our answer, and we want just the last one, just column eight. And you can do it this way with this particular notation. And then if you remember, we imported the train test split, that's part of the SK learn right there. And we simply put in our X and our Y. We're going to do random state equals zero. You don't have to necessarily seed it. That's a seed number. I think the default is one when you seat it, I'd have to look that up. And then the test size, test size is point two. That simply means we're going to take 20% of the data and put it aside so that we can test it later. That's all that is. And again, we're going to run it. Not very exciting. So far we haven't had any print out other than to look at the data, but that is a lot of this is prepping this data. Once you prep it, the actual lines of code are quick and easy. And we're almost there with the actual running of our KNN. We need to go ahead and do scale the data. If you remember correctly, we're fitting the data and a standard scalar, which means instead of the data being from, you know, five to 303 in one column, and the next column is one to six. We're going to set that all so that all the data is between minus one and one. That's what that standard scalar does. Keeps it standardized. And we only want to fit the scalar with the training set, but we want to make sure the testing set is the X test going in is also transformed. So it's processing it the same. So, here we go with our standard scalar. We're going to call it SC_X for the scalar, and we're going to import the standard scalar into this variable. And then our X train equals SC_X.fit_transform, so we're creating the scalar on the X train variable. And then our X test, we're also going to transform it. So we've trained and transformed the X train and then the X test isn't part of that training. It isn't part of the of training the transformer. It just gets transformed, that's all it does. And again, we're going to go ahead and run this. And if you look at this, we've now gone through these steps, all three of them. We've taken care of replacing our zeros for key columns that shouldn't be zero, and we replace that with the means of those columns. That way that they fit right in with our data models. We've come down here and we split the data. So now we have our test data and our training data. And then we've taken and we've scaled the data. So all of our data going in, no, no, we don't, we don't train the Y part, the Y train and Y test. That never has to be trained. It's only the data going in. That's what we want to train in there. Then define the model using K-neighbors classifier and fit the train data in the model. So we do all that data prep, and you can see down here, we're only going to have a couple lines of code where we're actually building our model and training it. That's one of the cool things about Python and how far we've come. It's such an exciting time to be in machine learning because there's so many automated tools. Let's see before we do this. Let's do a quick length of and let's do Y. We want, yeah, let's just do length of Y. And we get 768. And if we import math, we do math.square root length of Y train. There we go. It's actually supposed to be X train. Before we do this, let's go ahead and do import math and do math square root length of Y test. And when I run that, we get 12.409. I wanted to see show you where this number comes from, we're about to use. 12 is an even number, so if you know if you're ever voting on things, remember, the neighbors all vote, don't want to have an even number of neighbors voting, so we want to do something odd. And let's just take one away, we'll make it 11. Let me delete this out of here. This is one of the reasons I love Jupyter notebook, because you can flip around and do all kinds of things on the fly. So we'll go ahead and put in our classifier. We're creating our classifier now, and it's going to be the K-neighbors classifier. N neighbors equal 11. Remember, we did 12 minus one for 11. So we have an odd number of neighbors. P equals two, because we're looking for, is it are they diabetic or not? And we're using the Euclidean metric. There are other means of measuring the distance. You could do like square square means values, there's all kinds of measure this, but the Euclidean is the most common one, and it works quite well. It's important to evaluate the model. Let's use confusion matrix to do that. And we're going to use the confusion matrix. Wonderful tool, and then we'll jump into the F1 score. And finally, accuracy score, which is probably the most commonly used quoted number when you go into a meeting or something like that. So let's go ahead and paste that in there, and we'll set the CM equal to confusion matrix. Y test, Y predict. So those are the two values we're going to put in there. And let me go ahead and run that and print it out. And the way you interpret this is you have the Y predicted, which would be your title up here. You can do, uh, let's just do P R E D. Predicted across the top, and actual going down, actual. It's always hard to to write in here, actual. That means that this column here down the middle, that's the important column. And it means that the, and I believe the zero is the 94 people that don't have diabetes. The prediction said the 13 of those people did have diabetes and were at high risk. And the 32 that had diabetes, it had correct, but our prediction said another 15 out of that 15, it classified as incorrect. So you can see where that classification comes in and how that works on the confusion matrix. Then we're going to go ahead and print the F1 score. Let me just run that. And you can see we get a 0.69 in our F1 score. The F1 takes into account both sides of the balance of false positives. Where, if we go ahead and just do the accuracy count, that's what most people think of, is it looks at just how many we got right out of how many we got wrong. So a lot of people when you're a data scientist, and you're talking to other data scientists, they're going to ask you what the F1 score, what the F score is. If you're talking to the general public or the decision makers in the business, they're going to ask what the accuracy is. And the accuracy is always better than the F1 score, but the F1 score is more telling. It lets us know that there's more false positives than we would like on here. But 82% not too bad for a quick flash look at people's different statistics and running an SK learn and running the KNN, the K-nearest neighbor on it. So, we have created a model using KNN which can predict whether a person will have diabetes or not. Or at the very least, whether they should go get a checkup and have their glucose checked regularly or not. The print accuracy score, we got the 0.818, which was pretty close to what we got. And we can pretty much round that off and just say we have an accuracy of 80%. Tells us that it is a pretty fair fit in the model. To pull that all together, it's always a lot of fun, make sure we cover everything we went over today. We covered why we need a KNN, looking at cats and dogs. Great if you have a cat door and you want to figure out whether it's a cat or dog coming in. Don't let the dog in or out. Using Euclidean distance, the simple distance calculated by the two sides of the triangle or the square root of the two sides squared. Choosing the value of K, we discussed that a little bit. At least one of the main choices that people use for choosing K. And how KNN works? And then finally, we did a full KNN classifier for diabetes prediction. Thank you for joining us today. For more information, visit www.simplylearn.com. Feel free to post your questions in the YouTube video here or visit us on our website, and we'll also have a support page there you can go to and post additional questions. Again, thank you for joining us today.

KNN Algorithm In Machine Learning | KNN Algorithm Using Python | K Nearest Neighbor | Simplilearn
Simplilearn
29m 58s5,602 words~29 min read
AI audio transcription
Transcript source
AI audio transcription
This transcript was generated from the video's audio because no usable YouTube caption track was available. The transcript below is server-rendered so it can be read, searched, cited, and shared without opening the original YouTube player.
Pull quotes
[0:02]It's a basis of a lot of other things and just the logic behind it is easy to understand and incorporated in other forms of machine learning.
[0:02]And then we'll dive in to my favorite part, the use case, predict whether a person will have diabetes or not.
[0:02]That is a very common and popular used data set as far as testing out models and learning how to use the different models in machine learning.
[0:02]By now, we all know machine learning models make predictions by learning from the past data available.
Use this transcript
Related transcript hubs
Watch on YouTube
Share
MORE TRANSCRIPTS


