[0:03]Welcome to linear regression. My name is Richard Kirchner. I'm with Simply learn. Let's look at an example of a common use for linear regression. Profit estimation of a company. If I was going to invest in a company, I would like to know how much money I could expect to make. So we'll take a look at a venture capitalist firm and try to understand which companies they should invest in. So, we'll take the idea that we need to decide the companies to invest in. We need to predict the profit the company makes. And we're going to do it based on the company's expenses and even just a specific expense. In this case, we have our company, we have the different expenses, so we have our R&D, which is your research and development. We have our marketing, uh we might have the location, we might have what kind of administration's going through. Based on all this different information, we would like to calculate the profit. Now, in actuality, there's usually about 23 to 27 different markers that they look at if they're a heavy duty investor. We're only going to take a look at one basic one. We're going to come in and for simplicity, let's consider a single variable R&D and find out which companies to invest in based on that. So, when we take our R&D, and we're plotting the profit based on the R&D expenditure, how much money they put into the research and development and then we look at the profit that goes with that. We can predict a line to estimate the profit, so we can draw a line right through the data. And when you look at that, you can see how much they invest in the R&D is a good markers to how much profit they're going to have. We can also note that company spending more on R&D make good profit, so let's invest in the ones that spend a higher rate in their R&D. What's in it for you? First, we'll have an introduction to machine learning followed by machine learning algorithms. These will be specific to linear regression and where it fits into the larger model. Then we'll take a look at applications of linear regression, understanding linear regression and multiple linear regression. Finally, we'll roll up our sleeves and do a little programming in use case profit estimation of companies. Let's go ahead and jump in. Let's start with our introduction to machine learning. Along with some machine learning algorithms and where that fits in with linear regression. Let's look at another example of machine learning. Based on the amount of rainfall, how much would be the crop yield? So we here we have our crops, we have our rainfall and we want to know how much we're going to get from our crops this year. So we're going to introduce two variables, independent and dependent. The independent variable is a variable whose value does not change by the effect of other variables and is used to manipulate the dependent variable. It is often denoted as X. In our example, rainfall is the independent variable. This is a wonderful example because you can easily see that we can't control the rain. But the rain does control the crop. So when we talk about the independent variable controlling the dependent variable, Let's define dependent variable as a variable whose value change when there is any manipulation in the values of independent variables. It is often denoted as Y. And you can see here our crop yield is dependent variable and it is dependent on the amount of rainfall received. Now that we've taken a look at a real life example, let's go a little bit into the theory and some definitions on machine learning and see how that fits together with linear regression. Numerical and categorical values. Let's take our data coming in. And this is kind of random data from any kind of project. We want to divide it up into numerical and categorical. And so numerical is numbers, age, salary, height, where categorical would be a description. The color, a dog's breed, gender. Categorical is limited to very specific items where numerical is a range of information. Now that you've seen the difference between numerical and categorical data, let's take a look at some different machine learning definitions. When we look at a different machine learning algorithms, we can divide them into three areas, supervised, unsupervised, reinforcement. We're only going to look at supervised today. Unsupervised means we don't have the answers and we're just grouping things. Reinforcement is where we give positive and negative feedback to our algorithm to program it, and it doesn't have the information till after the fact. But today we're just looking at supervised because that's where linear regression fits in. In supervised data, we have our data already there and our answers for a group and then we use that to program our model and come up with an answer. The two most common uses for that is through the regression and classification. Now we're doing linear regression, so we're just going to focus on the regression side. And in the regression, we have simple linear regression, we have multiple linear regression, and we have polynomial linear regression. Now on these three, simple linear regression is the examples we've looked at so far where we have a lot of data and we draw a straight line through it. Multiple linear regression means we have multiple variables. Remember where we had the rainfall and the crops? We might add additional variables in there like how much food do we give our crops? When do we harvest them? Those would be additional information add in to our model and that's why it'd be multiple linear regression. And finally we have polynomial linear regression. That is instead of drawing a line, we can draw a curved line through it. Now that you see where regression model fits into the machine learning algorithms and we're specifically looking at linear regression. Let's go ahead and take a look at applications for linear regression. Let's look at a few applications of linear regression. Economic growth, used to determine the economic growth of a country or a state in the coming quarter, can also be used to predict the GDP of a country. Product price can be used to predict what would be the price of a product in the future. We can guess whether it's going to go up or down or should I buy today. Housing sales, to estimate the number of houses a builder would sell and at what price in the coming months. Score predictions, cricket fever, to predict the number of runs a player would score in the coming matches based on previous performance. I'm sure you can figure out other applications you could use linear regression for. So, let's jump in and let's understand linear regression and dig into the theory. Understanding linear regression, linear regression is a statistical model used to predict the relationship between independent and dependent variables. By examine two factors. The first important one is which variables in particular are significant predictors of the outcome variables? And the second one that we need to look at closely is how significant is the regression line to make predictions with highest possible accuracy. If it's inaccurate, we can't use it. So it's very important we find out the most accurate line we can get. Since linear regression is based on drawing a line through data, we're going to jump back and take a look at some Euclidean geometry. The simplest form of a simple linear regression equation with one dependent and one independent variable is represented by Y = m * X + c. And if you look at our model here, we plotted two points on here, uh X1 and Y1, X2 and Y2. Y being the dependent variable, remember that from before, and X being the independent variable. So Y depends on whatever X is. M in this case is the slope of the line. Where m equals the difference in the Y2 - Y1 and X2 - X1. And finally, we have C, which is the coefficient of the line or where it happens to cross the zero axis. Let's go back and look at an example we used earlier of linear regression. We're going to go back to plotting the amount of crop yield based on the amount of rainfall. And here we have our rainfall, remember we cannot change rainfall, and we have our crop yield, which is dependent on the rainfall. We're going to take this and draw a line through it as best we can through the middle of the data. And then we look at that, we put the red point on the Y axis is the amount of crop yield, you can expect for some amount of rainfall represented by green dot. So if we have an idea what the rainfall is for this year and what's going on, then we can guess how good our crops are going to be. And we've created a nice line right through the middle to give us a nice mathematical formula. Let's take a look and see what the math looks like behind this. Let's look at the intuition behind the regression line. Now, before we dive into the math and the formulas that go behind this and what's going on behind the scenes. I want you to note that when we get into the case study and we actually apply some Python script that this math you're going to see here is already done automatically for you. You don't have to have it memorized. It is, however, good to have an idea what's going on, so if people reference the different terms, you'll know what they're talking about. Let's consider a sample data set with five rows and find out how to draw the regression line. We're only going to do five rows because if we did like the rainfall with hundreds of points of data, that would be very hard to see what's going on with the mathematics. So we'll go ahead and create our own two sets of data and we have our independent variable X and our dependent variable Y. And when X was one, we got Y equals two. When X was two, Y was four and so on and so on. If we go ahead and plot this data on a graph, we can see how it forms a nice line through the middle. You can see where it's kind of grouped going upwards to the right. The next thing we want to know is what the means is of each of the data coming in, the X and the Y. The means doesn't mean anything other than the average. So we add up all the numbers and divide by the total, so 1 + 2 + 3 + 4 + 5 over 5 = 3. And the same for Y, we get four. If we go ahead and plot the means on the graph, we'll see we get 3, 4, which draws a nice line down the middle. A good estimate. Here we're going to dig deeper into the math behind the regression line. Now remember before I said, you don't have to have all these formulas memorized or fully understand them, even though we're going to go into a little more detail of how it works. And if you're not a math wiz and you don't know if you've never seen the sigma character before, which looks a little bit like an E that's opened up, that just means summation. That's all that is. So when you see the sigma character, it just means we're adding everything in that row. And for computers, this is great because as a programmer, you can easily iterate through each of the XY points and create all the information you need. So on the top half, you can see where we've broken that down into pieces and as it goes through the first two points, it computes the squared value of X, the squared value of Y and X times Y. And then it takes all of X and adds them up, all of Y adds them up, all of X squared adds them up and so on and so on. And you can see we have the sum of equal to 15, the sum is equal to 20, all the way up to X times Y where the sum equals 66. This all comes from our formula for calculating a straight line where Y equals the slope times X plus the coefficient C. So when we go down below and we're going to compute more like the averages of these and we're going to explain exactly what that is in just a minute and where that information comes from is called the square means error. But we'll go into that in detail in a few minutes. All you need to do is look at the formula and see how we've gone about computing it line by line instead of trying to have a huge set of numbers pushed into it. And down here you'll see where the slope M equals and then the top part if you read through the brackets, you have the number of data points times the sum of X times Y, which we computed one line at a time there and that's just the 66 and take all that and you subtract it from the sum of X times the sum of Y. And those are both been computed, so you have 15 times 20. And on the bottom, we have the number of lines times the sum of X squared, easily computed as 86 for the sum, minus all take all that and subtract the sum of X squared. And we end up as we come across with our formula, you can plug in all those numbers, which is very easy to do on the computer, you don't have to do the math on a piece of paper or calculator and you'll get a slope of 0.6. And you'll get your C coefficient if you continue to follow through that formula, you'll see it comes out as equal to 2.2. Continuing deeper into what's going behind the scenes. Let's find out the predicted values of Y for corresponding values of X using the linear equation where m=0.6 and c=2.2. We're going to take these values and we're going to go ahead and plot them. We're going to predict them, so Y equals 0.6 times where X equals one plus 2.2 equals 2.8, so on and so on. And here the blue points represent the actual Y values and the brown points represent the predicted Y values based on the model we created. The distance between the actual and predicted values is known as residuals or errors. The best fit line should have the least sum of squares of these errors, also known as E square. If we put these into a nice chart, where we can see X and you can see Y, what the actual values were, and you can see Y predicted. You can easily see where we take Y minus Y predicted and we get an answer. What is the difference between those two? And if we square that, Y minus Y prediction squared, we can then sum those squared values. That's where we get the 0.64 plus the 0.36 plus one, all the way down until we have a summation equals 2.4. So, the sum of squared errors for this regression line is 2.4. We check this error for each line and conclude the best fit line having the least E square value. In a nice graphical representation, we can see here where we keep moving this line through the data points to make sure the best fit line has the least square distance between the data points and the regression line. Now, we only looked at the most commonly used formula for minimizing the distance. There are lots of ways to minimize the distance between the line and the data points like sum of squared errors, sum of absolute errors, root mean square error, etc. What you want to take away from this is whatever formula is being used, you can easily, using a computer programming and iterating through the data, calculate the different parts of it. That way, these complicated formulas you see with the different summations and absolute values are easily computed one piece at a time. Up until this point, we've only been looking at two values, X and Y. Well, in the real world, it's very rare that you only have two values when you're figuring out a solution. So let's move on to the next topic, multiple linear regression. Let's take a brief look at what happens when you have multiple inputs. So in multiple linear regression, we have, uh, well, we'll start with the simple linear regression where we had Y = m + X + C and we're trying to find the value of Y. Now, with multiple linear regression, we have multiple variables coming in. So instead of having just X, we have X1, X2, X3. And instead of having just one slope, each variable has its own slope attached to it. As you can see here, we have M1, M2, M3 and we still just have the single coefficient. So when you're dealing with multiple linear regression, you basically take your single linear regression and you spread it out. So you have Y = m1 * x1 + m2 * x2, so on all the way to m into the nth, X into the nth and then you add your coefficient on there. Implementation of linear regression. Now we get into my favorite part. Let's understand how multiple linear regression works by implementing it in Python. If you remember before, we were looking at a company and just based on its R&D, trying to figure out its profit. We're going to start looking at the expenditure of the company, we're going to go back to that, and we're going to predict its profit. But instead of predicting it just on the R&D, we're going to look at other factors like administration costs, marketing costs, and so on. And from there, we're going to see if we can figure out what the profit of that company's going to be. To start our coding, we're going to begin by importing some basic libraries. And we're going to be looking through the data before we do any kind of linear regression. We're going to take a look at the data, see what we're playing with, then we'll go ahead and format the data to the format we need to be able to run it in the linear regression model. And then from there, we'll go ahead and solve it and just see how valid our solution is. So let's start with importing the basic libraries. Now, I'm going to be doing this in Anaconda Jupiter Notebook, a very popular IDE. I enjoy it, it's such a visual to look at, and so easy to use. Um, just any IDE for Python will work just fine for this. So break out your favorite Python IDE. So here we are in our Jupiter Notebook. Let me go ahead and paste our first piece of code in there. And let's walk through what libraries we're importing. First, we're going to import NumPy as NP. And then I want you to skip one line and look at import Pandas as PD. These are very common tools that you need with most of your linear regression. The NumPy, which stands for number Python, is usually denoted as NP and you have to almost have that for your SK learn toolbox. So you always import that right off the beginning. Pandas, although you don't have to have it for your SK learn libraries, it does such a wonderful job of importing data, setting it up into a data frame so we can manipulate it rather easily. And it has a lot of tools also in addition to that. So we usually like to use the Pandas when we can, and I'll show you what that looks like. The other three lines are for us to get a visual of this data and take a look at it. So we're going to import Matplotlib dot pyplot as PLT and then Seaborn as SNS. Seaborn works with the Matplotlib library, so you have to always import Matplotlib and then Seaborn sits on top of it. And we'll take a look at what that looks like. You could use any of your own plotting libraries you want. There's all kinds of ways to look at the data. These are just very common ones and the Seaborn is so easy to use. It just looks beautiful, it's a nice representation that you can actually take and show somebody. And the final line is the Ampersand Matplotlib inline. That is only because I'm doing an inline IDE. My interface in the Anaconda Jupiter Notebook requires I put that in there or you're not going to see the graph when it comes up. Let's go ahead and run this. It's not going to be that interesting, so we're just setting up variables. In fact, it's not going to do anything that we can see, but it is importing these different libraries and setup. The next step is load the dataset and extract independent and dependent variables. Now, here in the slide you'll see companies equals pd.read_csv and it has a long line there with the file at the end, 1000 companies.csv. You're going to have to change this to fit whatever setup you have. And the file itself, you can request. Just go down to the commentary below this video and put a note in there and Simply Learn will try to get in contact with you and supply you with that file so you can try this coding yourself. So, we're going to add this code in here and we're going to see that I have companies equals pd.reader_csv and I've changed this path to match my computer, C colon Simply Learn 1000 Companies.csv. And then below there, we're going to set the X equals to companies under the I location, and because this is companies is a PD data set. I can use this nice notation that says take every row, that's what the colon, the first colon is, comma, except for the last column. That's what the second part is, where we have a colon minus one, and we want the values set into there. So X is no longer a data set, a Pandas data set. But we can easily extract the data from our Pandas data set with this notation. And then Y, we're going to set equal to the last row. Well, the question's going to be, what are we actually looking at? So, let's go ahead and take a look at that and we're going to look at the companies.head, which lists the first five rows of data. And I'll open up the file in just a second so you can see where that's coming from. But let's look at the data in here as far as the way the Pandas sees it. When I hit run, you'll see it breaks it out into a nice setup. This is what Pandas, one of the things Pandas is really good about is it looks just like an Excel spreadsheet. You have your rows. And remember, when we're programming, we always start with zero. We don't start with one. So it shows the first five rows, 0, 1, 2, 3, 4. And then it shows your different columns, R&D spend, administration, marketing spend, state, profit. It even notes that the top are column names. It was never told that, but Pandas is able to recognize a lot of things that they're not the same as the data rows. Why don't we go ahead and open this file up in a CSV so you can actually see the raw data. So, here I've opened it up as a text editor. And you can see at the top, we have R&D spend, comma, administration, comma, marketing spend, comma, state, comma, profit, carriage return. I don't know about you, but I'd go crazy trying to read files like this. That's why we use the Pandas. You could also open this up in an Excel and it would separate it since it is a comma separated variable file. But we don't want to look at this one. We want to look at something we can read rather easily. So let's flip back and take a look at that top part, the first five rows. Now, as nice as this format is where you can see the data, to me, it doesn't mean a whole lot. Maybe you're an expert in business and investments and you understand what, uh, 165,349.20 compared to the administration cost of 136,897.80, so on, so on. Helps to create the profit of 192,261.83. That makes no sense to me whatsoever. No pun intended. So let's flip back here and take a look at our next set of code where we're going to graph it so we can get a better understanding of our data and what it means. So at this point, we're going to use a single line of code to get a lot of information, so we can see where we're going with this. Let's go ahead and paste that into our, uh, notebook and see what we got going. And so we have the visualization and again, we're using SNS, which is Pandas. As you can see, we imported the Matplotlib.pyplot as PLT, which then the Seaborn uses. And we imported the Seaborn as SNS. And then that final line of code helps us show this in our, um, inline coding. Without this, it wouldn't display and you could display it to a file and other means and that's the Matplotlib inline with the Ampersand at the beginning. So here we come down to the single line of code. Seaborn is great because it actually recognizes the Pandas data frame. So I can just take the companies.core for coordinates and I can put that right into the Seaborn. And when we run this, we get this beautiful plot. And let's just take a look at what this plot means. If you look at this plot, on mine, the colors are probably a little bit more purplish and blue than the original one. Uh, we have the columns and the rows. We have R&D spending, we have administration, we have marketing spending and profit. And if you cross index any two of these, since we're interested in profit, if you cross index profit with profit, it's going to show up, if you look at the scale on the right, way up in the dark. Why? Because those are the same data, that's exactly the same. So R&D spending is going to be the same as uh R&D spending. And the same thing with administration costs are right down the middle, you get this dark row or dark, um, diagonal row that shows that this is the highest corresponding data. That's exactly the same. And as it becomes lighter, there's less connections between the data. So we can see with profit, obviously profit is the same as profit, and next it has a very high correlation with R&D spending, which we looked at earlier. And it has a slightly less connection to marketing spending and even less to how much money we put into the administration. So now that we have a nice look at the data, let's go ahead and dig in and create some actual useful linear regression models so that we can predict values and have a better profit.
[29:05]Now that we got to the linear regression model, we get that next piece of the puzzle. Let's go ahead and put that code in there and walk through it. So here we go. We're going to paste it in there. And let's go ahead and, uh, since this is a shorter line of code, let's zoom up there so we can get a good look. And we have from the SK learn.linear_model, we're going to import linear regression. Now, I don't know if you recall from earlier, when we were doing all the math, let's go ahead and flip back there and take a look at that. Do you remember this where we had this long formula on the bottom and we were doing all this summation?
[29:46]All of that is wrapped up in this one section. So what's going on here is I'm going to create a variable called regressor. And the regressor equals the linear regression, that's the linear regression model that has all that math built in. So we don't have to have it all memorized or have to compute it individually. And then we do the regressor.fit. In this case, we do X train and Y train because we're using the training data, X being the data in and Y being profit, what we're looking at. And this does all that math for us. So with in one click and one line, we've created the whole linear regression model and we fit the data to the linear regression model. And you can see that when I run the regressor, it gives an output linear regression, it says copy X equals true, fit intercept equals true, in jobs equal one, normalize equals false. It's just giving you some general information on what's going on with that regressor model. Now that we've created our linear regression model, let's go ahead and use it. If you remember, we kept a bunch of data aside. So we're going to do a Y predict variable and we're going to put in the X test. And let's see what that looks like in code. And so here we go. We're going to paste that in here, and I'll scroll up a little bit. Paste that in here. Predicting the test set results. Y pred equals regressor.predict, X test going in, and this gives us Y pred. Now, because I'm in Jupiter inline, I can just put the variable up there. And when I hit the run button, it'll print that array out. I could have just as easily done print Y pred. So if you're in a different IDE that's not an inline setup like the Jupiter Notebook, you can do it this way, print Y pred. And you'll see that for the 200 different test variables we kept off to the side, it's going to produce 200 answers. This is what it says, the profit are for those 200 predictions. But let's don't stop there. Let's keep going and take a couple look, we're going to take just a short detail here and we're going to be calculating the coefficient and intercepts. This gives us a quick flash at what's going on behind the line. We're going to take a short detour here and we're going to be calculating the coefficient and intercepts. So you can see what those look like. What's really nice about our regressor we created is it already has the coefficients for us. And we can simply just print regressor.coefficient_. When I run this, you'll see our coefficients here. And if we can do the regressor coefficient, we can also do the regressor intercept. And let's run that and take a look at that. This all came from the multiple regression model. And we'll flip over so you can remember where this is going into and where it's coming from. You can see the formula down here where Y = m1 * X1 + m2 * X2 and so on and so on plus C the coefficient. So these variables fit right into this formula. Y = slope one times column one variable plus slope two times column two variable all the way to the M into the N and X into the N plus C, the coefficient. Or in this case, you have minus 8.89 to the power of two, etc, etc, times the first column and the second column and the third column. And then our intercept is the minus 103009. Boy, it gets kind of complicated when you look at it. This is why we don't do this by hand anymore. This is why we have the computer to make these calculations easy to understand and calculate. I told you that was a short detour and we're coming towards the end of our script. As you remember from the beginning, I said if we're going to divide this information, we have to make sure it's a valid model, that this model works and understand how good it works. So calculating the R squared value. That's what we're going to use to predict how good our prediction is. And let's take a look at what that looks like in code. And so we're going to use this from SKlearn.metrics, we're going to import R2 score. That's the R squared value. We're looking at the error. So, in the R2 score, we take our Y test versus our Y predict. Y test is the actual values we're testing. That was the one that was given to us that we know are true. The Y predict of those 200 values is what we think it was true. And when we go ahead and run this, we see we get a 0.9352. That's the R2 score. Now, it's not exactly a straight percentage, so it's not saying it's 93% correct. But you do want that in the upper nineties, oh, and higher shows that this is a very valid prediction based on the R2 score.
[34:11]Which means success! Yay! We successfully trained our model with certain predictors and estimated the profit of the companies using linear regression. So, now that we have a successful linear regression model, let's take a look at what we went over today and take a look at our key takeaways. First, we have an introduction to machine learning where we talked about some general setup and predicting crops and weather. We saw this that numbers are age, salary, so on. Then we have categorical color. When we did our actual regression model, we saw that we had numbers, which was dollar amounts, and we had a location, which was Florida and New York, which was categorical that we had to convert. Next we have application of a linear regression model. We had some showing there some different applications you could use it for. We had our use case implementation of linear regression where we dug in deep. We showed how those are set up. We had our multiple linear regression model, so you can see the math behind it. And finally, prediction using the regression line. So we showed you how to predict things on a regression line setup and the actual scripting and code. That concludes our demo today. I want to thank you for joining the Simply Learn team. Remember, if you have any questions on the video, or you wish to request a copy of the CSV file, we use to generate this linear regression. Feel free to comment down below, and we'll get back to you as soon as we can. Thank you very much.
[35:34]Hi there. If you like this video, subscribe to the Simply Learn YouTube channel, and click here to watch similar videos.



