Thumbnail for Heart Disease Prediction using Machine Learning | Part 3 | End to End ML Project| Data Preprocessing by At A Glance!

Heart Disease Prediction using Machine Learning | Part 3 | End to End ML Project| Data Preprocessing

At A Glance!

23m 29s3,424 words~18 min read
Auto-Generated

[0:09]Hello everyone, welcome back to my YouTube channel. In this particular video, we'll perform the data processing step. Here I am inside Google Collab. You can use Jupiter Notebook also. I have uploaded the data set onto my Google Drive and for accessing that, I'm going to mount the Google Drive in this particular collab notebook. So for that, you can run this particular command. Let me run it. And you can see that drive has already got mounted inside this particular notebook. Now, we'll import the most relevant libraries that we are going to use in this particular step. So here I'm going to use NumPy and Pandas for this and you can see I have given the alias for NumPy as np and Pandas as pd. So with that, I also have imported the warnings just to avoid the irrelevant warnings that will pop up after executing the code. Now, let me run this particular cell. You can see it has got successfully executed. Now we'll load the data set. Now, we have the read_CSV function inside the Pandas. You know that Pandas has given the alias pd. So pd. the read_CSV function and inside which we have to pass on the path where we have stored our data set. So here is the path and the file name is hard.csv. So we'll run this and we'll store the data set inside the heart_DF variable. Now, let's print this particular variable. You can see we have successfully accessed the data set and we have stored it into the variable heart_df. Now, the next step is to get the summary, but before that, I also have displayed some random five samples in our data set. So this sample function will take the value and whatever is the value, that many number of rows randomly will be selected and will be displayed. So you can see here are the random five values or rows from the data set. The next step is to get the summary. So here you can see that with the dot info function, we can get the summary of our data set. In this particular summary, let's see what we what we get. You can see the column name is given with the non-null value that means whether the column is having any null values or not. Here, you can see each and every column doesn't have any null value, which is a good sign. Since there are no null values in any of the column, our workload has got reduced and here the count is also given, that each and every column are having 918 total values. Next is the data type. So data type of every column is mentioned. You can see the age data type is integer. Basically, all the numerical columns are having the data type integer, except the old peak. Old pick column has a floating value. That is why its data type is float and others which are categorical features are having their data type as object. Now, let's get the statistics of our data set. So for getting the statistics, that means for all numerical columns, what is the basic statistic that we can perform on it? We can get that with the help of the describe function inside this data frame. So let's execute this. Now here you can see we have got the basic statistics of all the numerical columns, which are present in our data set. You can see the count of each column is present, then the mean value of every single column. Let's say in each column, let's say there are total 918 values. So the mean of all these 918 values is also given 53.5. Similarly, for all other columns also. The standard deviation is mentioned, then the minimum and the maximum value is also mentioned. You can see in the resting blood pressure column, the minimum value is zero. Similarly, in the cholesterol column, the minimum value is zero, which is not possible. Here, there is surely a mistake that has occurred while noting the values, so we'll have to clean this particular data and in this particular video, we'll see how we can clean it. So I hope the basic statistics is clear to you all. You can read this in detail. So now, you know that in this particular step, we didn't get the statistics of each and every column. We only got those columns whose data type is numeric. Now for getting the statistics of every single column irrespective of the data type, we can use this particular command inside this particular dot describe function. We can write include equals all, which will include all the columns irrespective of their data type. So you can see that we have got the statistics of every single column. The statistics of the gender column, then the statistics of the chest pain type is also mentioned. Here you can see their frequency is mentioned, the unique values are mentioned and the top, that the mostly occurred or I would say the mostly occurred values are written, that is asymptomatic. So, I hope you have got the basic idea of our data set. Now the next step is to start with the pre-processing. So here we go. So now, first we'll check for the null values. So you know that null values can easily be checked with the help of this particular step. We have to write this heart_df and then dot isna. Or we can also write is null. This function and then we can write dot sum. So basically if we only print this particular function, let's say what we get. See, here we are getting the same column, the same data set with all the Boolean values, that whether at this particular record, whether that particular value is null or not. But if we want the summarized, in a summarized way if we want it, that means we only want the count of the null values, then we can write the dot sum function to it. So this will print the null values count. So here you can see the age column has zero null values, the sex column has zero null values and so on. So I hope this thing is clear. Now, let's check for the duplicates. Now for checking the duplicated value, we have this particular function duplicated. This will check whether there are any duplicates present inside the data set or not and then we are applying the sum function over it. So this will give us the actual value of the entire data set, whether we are having any duplicated rows or not. So here there are zero duplicated rows. So we don't have to worry about it. Now the next is we'll check the number of unique values in each feature. Now the unique values in each feature can be checked with the help of the unique function. You can just execute this and we'll get the unique values in each and every column. Now we have checked the unique values. And now we'll try to separate the categorical feature values into numeric values because at the end, the machinery model will only accept the numeric values. So for doing that, we'll first try to separate or extract the name of the columns whose data type is object, which is categorical.

[8:26]So let's write some piece of code for extracting the categorical columns. So let me first show you how the columns names are extracted. So we have this heart_df data set and inside this if we try to write dot columns, We'll get the list of columns which are present inside this particular heart_df data set. You can see out of this we want only those columns whose data type is object.

[8:57]So we have a function which selects only those data type which we mention. So we have this, we have to write select_ yeah, this one, select_d types and inside this we need to include of the data type. So here we want only the columns whose data type is object and then we can write dot columns because we want only the name of the columns. So if we run this particular thing, we'll get the list of only those columns whose data type is object. We'll store it in the variable cat_col and let's execute this. So the so we have got this particular variable in which we have stored the name of the columns whose data type is object, which is categorical. Now, I have list, I have written here the name of the columns with their values. Here we want to convert the male inside the sex column with the zero value, then female with one value. Similarly, inside chest pain type, as you know, we have total four values. Here, we are going to convert the ATA, that is the A typical angina to zero, then non-anginal pain to one and so on. So, we'll write this particular piece of code to convert it. So what is written inside this? Let's see line by line. So here you can see that we are iterating in the cat_col list, which we have already created, that is the list of columns whose data type is object. So inside this for every single column, we will try to print the name of the column and then what we are going to do, we are just printing the total unique values inside this particular column. Let me show you how we can print the unique value. What exactly is the output of this. So you can see if I specifically mentioned mentioned any single column name, let's say it is chest pain type. Let me run this and we'll get the list of the unique values present inside it. Now, the order of these unique values is not going to change. We will have this particular order only and for this particular order only, we are going to write the or replace these values with 0, 1, 2, 3 so on. So here ATA will have the zero value, then NAP will have the value one, then ASY will have the value and TA will have the value three. So we don't want to change the order and that is why we are using the range function. You can also use the one hot encoding technique or the label encoding technique for this. But since we are going to be very specific with this, as there are a lot of categorical values, they're going to use this particular technique, which is going to convert this particular order into the values ranging from zero to the length of the unique values minus one. So here you can see, if we write this piece of code, let me copy it and let me show you what exactly we will get. So here you can see, we are getting a range between 0 to 3. So here the value of the column, let me mention it. So here you can see, once I execute this, we are going to get a range from 0 to 4 because there are total four unique values over here. Uh this is going to be the value for every single unique value present inside this particular column. Now, the next step is that we are going to actually replace the these unique values. See, here you can see for every single column, we are going to replace the unique value list, that means all the unique values for specifically for the chest pain type column, there will be a list of four values. So, for all these four values, we are going to replace it with the range of these four values, which means that ATA will be replaced with zero. Then NAP will be replaced with one and so on and we are also specifying the in place equals to true parameter, which will permanently convert, replace the values in the data set and we'll try to print it and let's see how it looks. So here you can see as you know, I already told you that we are going to first print. So here you can see for this particular command, that is the total number of unique values inside the text that is male and female. And then this represents with what value we are going to replace this male and female. So here you can see we are getting a list. So zero will be the value, which will be the replaced value for males and for female the value one will be present. Similarly, for the chest pain type column, there are total four different unique values. So with the help of this range function, we are going to convert them with this. So I hope the things are clear. This is how we have replaced the unique values. I hope it is clear. So, let me show you how it looks the data set. So here you can see in the data set, now we are not having the categorical values. We are having the numerical values. It has got replaced successfully.

[15:03]Now we are good to go with this particular data set. Now one thing we are left to do is that we will have to first filter out those values in the cholesterol and the resting blood pressure column whose value is zero because zero cannot be the proper value of any patient. A patient won't be having zero blood pressure or zero cholesterol value. We'll have to figure out them and then we'll have to replace with some proper values. So let's see what is the total value counts inside the cholesterol column. You can see, these are the total unique values and in this particular cholesterol values, for value zero, there are total 172 occurrences. That means there are total 172 records whose value is zero for cholesterol. Now this is, this might be a technical issue. The cholesterol might not have got counted, so we'll have to figure out them and replace it with some proper values. So we can use the KNN Imputer technique for replacing this zero values. You can also use the mean or median technique by simply looking at the distribution of this particular cholesterol column and replace it with mean or median. But I'm going to use the KNN Imputer technique in this particular case. So, let's see how we can use this particular technique. So first we'll replace all these zero values with the nan values. So basically np.nan generates a nan value. NP is NumPy alias. And inside NumPy if I try to write it np.nan, so you can see how it gives. You can see it generates a nan value. So first thing we'll do is we'll replace all the zeros inside the cholesterol column with the nan value and we'll write in place equals to true so that it will be permanent inside the cholesterol column. So now all the zero values are replaced with the nan values. Now, we have this KNN Imputer inside the sklearn.impute module. So we'll import it and then we'll write imputer equals to KNN Imputer and here inside this we'll specify the total number of neighbors. So here we'll specify the total number of neighbors as three and then we will write the imputer that is the object. fit_transform and inside this we'll put the heart_df data set. So inside this particular entire data set, wherever it finds out the nan values, it will replace that particular nan value with the calculated KNN value. So you can see, we have got this and we'll store this in the variable after_impute. So now, we will try to convert this into the data frame and columns are nothing but the heart_df.columns. So let's try to uh execute this particular cell and you can see it has got executed. And let's try to see the data set. So here we won't be able to figure out that whether the nan is replaced or not.

[18:39]So let's try to look specifically at the cholesterol column and inside this, we will try to check whether there are any nan values present or not. So dot isna.dot sum. So you can see there are now zero null values present inside this particular column. So, also, uh, if you want to check whether there are any zero values left now, uh you can just write this particular piece of code and it will iterate over every single value and it will check whether there are zero values present inside it or not. So, if we try to print this particular count, then it will be zero because we already have imputed the zero values inside the cholesterol column with the K nearest neighbor algorithm and we have filled those values with some particular, uh, proper value. Now, we'll do the same thing with the blood pressure. We have already seen that the blood pressure column has the zero values present inside it. So let's try to see what are those. You can see there is only one row in which the resting blood pressure value is zero. So we'll try to replace this also. Now, we'll use the same technique, KNN Imputer and this time we'll mention the resting BP as the column and we'll try to replace it with the nan value and then these nan values will be replaced with the KNN Imputer. So let's try to run this and it will and it is done. So you can see, now if we try to look at the unique values inside the resting blood pressure, it won't be having any zero values inside it. So you can also see, uh we'll try to write this particular command, we'll try to check whether there are any null values present inside this particular resting blood pressure column. Let's execute this. We will not be having any null values also. So there are no zero values, as well as there are no null values inside the resting blood pressure column. So, now we are good to go. Now for all those categorical columns whose values we have changed to numeric, we'll try to change the type of that particular columns. So you can see, uh we'll store the heart_df.columns value inside the variable. Now here I have written the name of the variable as without old peak because the old peak is the only column whose data type is floating. So let's exclude that and you can see we have simply dropped the old peak column from the list of columns inside our data frame. We'll now try to change the data type of all the other columns except the old peak column to integer with the help of the as type function. So let's execute this.

[21:49]So you can see, the data type has got changed for all those columns. And if we try to check the info of our data set, now how it looks. You can see all the other columns are having their data type as integer 32, except the old peak column. So I hope it is clear. And in this particular entire video, we have checked for the null values. We have checked the duplicated values. We change the values of the categorical columns to numeric and we have done the pre-processing of it. We have also converted all the zero values inside the cholesterol and the resting blood pressure column to a proper value with the help of the KNM Imputer technique and then we finally have converted the data type of the columns into integer. So I hope this entire pre-processing step is clear to you all. Now, in the next video, we'll perform the data visualization step. We'll have a lot of graphs, we'll plot different things and we'll try to understand the data in a more detailed way. So I hope you are excited for that. If you like this particular video, please hit the like button and also I would like to request you all to please post your suggestions and queries in the comment section. If you like this video, please like, share and subscribe to my channel. Also hit the bell icon and don't forget to follow me on Instagram. And also don't forget to follow me on Telegram.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript