Thumbnail for [05x03] Logistic Regression | Classification | Supervised Learning | Machine Learning [Julia] by doggo dot jl

[05x03] Logistic Regression | Classification | Supervised Learning | Machine Learning [Julia]

doggo dot jl

35m 32s3,547 words~18 min read
Auto-Generated

[0:00]Last week, we got an introduction to supervised learning. Specifically, we got an introduction to regression, which is a sub field of supervised learning. Today, we'll continue our exploration of supervised learning by getting an introduction to the other major sub field of supervised learning called classification. There are many different classification algorithms used in machine learning. In today's tutorial, we'll learn about the logistic regression algorithm. Despite having the term regression in its name, logistic regression is not used to solve regression problems. The logistic regression algorithm is widely used in machine learning to solve classification problems. So, what is classification? And how is it different than regression? Well, let's find out. Welcome to Julia for talented amateurs, where I make wholesome Julia tutorials for talented amateurs everywhere. I am your host, the dabbling doggo. A workflow for solving classification problems using a machine learning approach is almost identical to the workflow that we learned last week to solve regression problems using the linear regression algorithm. As a result, you'll be hearing a lot of the same terms that I introduced in last week's tutorial. As a reminder, supervised learning is when the user provides both the inputs and the outputs, and then the computer tries to estimate the function that determines the relationship between the inputs and the outputs. Supervised learning may be split further between regression and classification. In today's tutorial, we'll learn about classification. So how is classification different than regression? Last week, we learned that regression is used to understand the relationship between some independent variable X in order to predict the value of a dependent variable Y. For a regression problem, that relationship is set to be continuous, since for every value of X, there's a corresponding value of Y. Classification is used to predict a discrete value of Y, given a value of some independent variable X. The output in classification problems is not continuous. Some examples of classification problems include identifying whether an email is spam or not spam, or identifying whether a tumor is malignant or benign. In these examples, there are only two possible outcomes for Y, one or zero. One represents a positive outcome, meaning that the email is spam, or that the tumor is malignant. Zero represents a negative outcome, meaning that the email is not spam, or that the tumor is benign. Here, positive and negative do not have the same connotations as they might have in English.

[3:42]In machine learning, the terms positive and negative are used in a statistical sense.

[4:48]By the end of this tutorial, you'll know how to build your own logistic regression machine learning classification algorithm to make predictions based on a binary classification problem. Knowledge of Julia and VS Code is required for this tutorial. I'm also assuming that you've watched last week's episode, episode 502, which introduced many of the machine learning concepts that we'll be applying today. Okay, let's fire up VS Code and get started. In your VS Code Explorer panel, create a new folder for this tutorial. In the tutorial folder, create a new file called SL_regression.jl. Launch the Julia Repl by using Alt J, then Alt O. Maximize the terminal panel. Change the present working directory to your tutorial directory. Enter the package Repl by hitting the closing square bracket. Activate your tutorial directory. Add the CSV and Plots packages. Type in status to confirm the version numbers. Exit the package Repl by hitting backspace. Minimize the terminal panel.

[6:24]Before getting to the machine learning algorithm, let's take a look at the logistic curve. As you can see in this Wikipedia article, the logistic curve has a distinct elongated S-shape, starting from zero on the Y-axis when X is negative infinity, then crossing the Y-axis at 0.5, and then going to one on the Y-axis when X is positive infinity. This logistic curve is also known as the sigmoid curve. Unlike the straight line that we use to solve linear regression problems, the logistic curve will always have an output that falls between zero and one for any input. That script to E in the equation for the logistic curve is Euler's number, which is like the less popular version of Pi in the world of mathematical constants. Euler's number has a value of approximately 2.718. While everyone celebrates Pi Day on March 14th, sadly, no one celebrates Euler Day on February 7th. The logistic function has been around since the 1800s. It was introduced by Pierre-Francois Verhulst, who was a Belgian mathematician who used it to explain the stages of population growth with the initial stage growing exponentially, and the later stage slowing exponentially, until the population levels off at some saturation value. It's a mystery as to why he used the term logistic to describe this curve, but the term has survived and so it's still being used today. Let's use the Plots package to generate our own plot of the logistic curve, so that we can become more familiar with it.

[8:30]This is the same version of the logistic curve that was in the Wikipedia article. EXP is a built-in Julia function that allows you to enter Euler's number E to the power of some number. Here, we're taking E to the power of negative X.

[9:16]So great, we were able to create our own logistic curve. Now what? Last week, we used the formula for our line, Y = MX + B, to find a straight line that was a best fit for our data. We did that by changing the values for the Y-intercept and the slope of that line. This week, we want to do something similar with the logistic curve. We want to modify the shape of this curve to best fit our data so that we can use it to make classification predictions. But, how do we modify this curve? If you look at the formula, there are no parameters for anything like the Y-intercept or the slope. So, some clever folks realized that not only can you enter in a discrete value for X in this equation, you can enter in a function for X, which is what we're going to do. So, what function are we going to use? The function that we're going to use is the equation for a straight line Y = MX + B.

[10:46]These are the same parameters that we used last week. For the logistic regression algorithm, Theta_0 is similar in concept to the Y-intercept used in linear regression, and Theta_1 is similar in concept to the slope used in linear regression. Last week, I initialized both values at zero. This week, I'm setting Theta_0 to zero and Theta_1 to one. These values will generate the default logistic curve that we just plotted. But note that these values do not correspond to the actual Y-intercept or the slope in the default logistic curve. The default logistic curve crosses the Y-axis at 0.5, and there's no equivalent straight line slope, since this is a curve. Instead, for the logistic regression algorithm, you can think of the parameters Theta_0 and Theta_1 as dials that your computer can turn in order to adjust the shape of the curve. Let's take a look at how to do that.

[12:05]You should recognize this function as the formula for a straight line Y = MX + B, which is what we used as the hypothesis function last week.

[12:24]This week, we're going to use the logistic curve as our hypothesis function, but we're going to replace the X in the equation with a formula for a straight line.

[12:45]As you can see, this plot looks exactly the same as the default logistic curve. The difference is that we now have the parameters Theta_0 and Theta_1 that we can change in order to modify this curve. Change the value of Theta_0 from zero to one and see what happens.

[13:09]Increasing the value of Theta_0 moves the Y-intercept up along the Y-axis. Since the values along the Y-axis must fall between zero and one, the entire curve does not shift up. Instead, the curve shifts to the left. Try changing the value of Theta_0 to negative one and see what happens.

[13:36]Decreasing the value of Theta_0 moves the Y-intercept down along the Y-axis, thus shifting the curve to the right. Change the value of Theta_0 back to zero, and then try changing the value of Theta_1 from one to 0.5.

[14:00]Decreasing the value of Theta_1 decreases the quote-unquote slope of the curve. So it makes it flatter as the curve transitions from zero to one. Of course, it's not really the slope, but you can think of it that way since it behaves the same way. Now, try changing the value of Theta_1 from 0.5 to negative 0.5.

[14:28]Using a negative value of Theta_1 results in a downward sloping curve, just like it would with a straight line. Pretty cool, right? Go ahead and play around with the settings until you get a feel for how changing the values of Theta_0 and Theta_1 will modify the curve. Now that we have a curve that we can modify, let's take a look at our data.

[14:55]For this tutorial, you will need a data set that you can download from my GitHub repository. There's a link to it in the description below. You can save it by right-clicking on it. When you download the file, save it to your tutorial directory. It's a CSV file called wolfspider.csv, and it contains a collection of input-output pairs showing the size of grains of sand as the input or feature, and showing a binary output as either present or absent, indicating either a positive or negative class, testing for the presence of a wolf spider in the sand. This data is from a 2006 research paper by Suzuki, Tsurusaki, and Kodama, titled distribution of an endangered burrowing spider in the San-in Coast of Honshu, Japan. I apologize for my pronunciation.

[15:59]In the interest of conservation, the researcher surveyed habitats along various beaches in Japan, where they were looking for the presence of an endangered burrowing spider, commonly known as the wolf spider. We're going to use a simplified version of the data presented in their article as our motivating example in order to build a logistic regression model that we can use to predict the probability of the presence of wolf spiders based solely on the size of grains of sand on a beach in Japan. Now that we have our data stored on disk, let's load it into memory. Using the CSV package to import data from a CSV file.

[16:44]This is the input data that measures the grains of sand in millimeters. This is the output data, which is a string that is either present for a positive class, or absent for a negative class. In order for us to use this data, we need to convert the strings into either a one for a positive class, or a zero for a negative class.

[17:22]Okay, now that we have our data loaded into memory, let's plot it and take a look at it.

[17:51]As you can see, all of the data points are either zero or one along the Y-axis. Interestingly, a wolf spider seems to be present regardless of the size of the grains of sand, but is only absent when the grains are relatively small. Just by looking at this data, it's not obvious how you might use the size of the grains of sand in order to protect the presence or absence of the spider, since there's a lot of overlapping data. Instead, what we want to do is to fit a logistic curve to this data in order to generate a model that we can use to predict the probability of the presence of the spider given the grain size. Again, just by looking at the data, it's not obvious how to find a best fit logistic curve for this data. That's why we need machine learning.

[18:49]Even though we're trying to solve a classification problem rather than a regression problem, we're going to follow the exact same machine learning workflow that we use last week. As a reminder, that workflow includes these steps: One, initialize the parameters. Two, define a hypothesis function. Three, define a cost function. Four, define an optimization algorithm. Five, initialize the hyper parameters. Six, change the values of the parameters. Seven, recalculate the cost. Eight, iterate until an optimal value is found for the cost. While the overall workflow is the same, the actual functions that we will be using will be slightly different. Let's start by initializing the parameters.

[19:56]This week, let's track the values of both Theta_0 and Theta_1 so that we can see how they change over time.

[20:13]Next, we need to define our hypothesis function.

[20:27]Let's add this initial logistic curve to our plot.

[20:36]So this green curve is the initial logistic curve. The reason why it doesn't look like the S-shape curve that we saw earlier is because we're only seeing the range when X is between zero and one. somehow, we need to get our computer to learn how to modify this curve so that it does a better job of fitting our data. In order to do that, we need to define a cost function.

[21:26]Let's pause right here. This cost function looks a lot different than the cost function that we used last week. It looks really scary, since it has the log function in it. Just like last week, I'm using the equations from Andrew Ng. Without getting into the math, here's how you can think about this cost function. When defining any cost function in machine learning, you want a function that rewards correct predictions, and punishes incorrect predictions. With this cost function, if your computer predicts a class of one when the actual class in the data set is zero, then your computer will pay a large penalty, meaning that the cost will be high. Conversely, if your computer predicts a class of one and the actual class in the data set is one, then your computer will not pay a penalty at all, meaning that the cost will be zero. Based on how this function is designed, all your computer needs to do is adjust the parameters until it finds a minimum value for this cost function. Just like we did last week, let's keep track of this cost over time.

[22:49]Next, we need to define our optimization algorithm. Just like we did last week, we'll be using the batch gradient descent algorithm. But because our cost function is different, the partial derivative function will also be different.

[23:34]Next, let's initialize our hyperparameters. Unlike last week when I cheated when setting the learning rates, this week, I'm using the default learning rate of 0.01 for all parameters. Last week, we did everything manually, so we didn't have that many epochs. But this week, we're going to go through many iterations, so this number will be more significant. Next, we need to set up the stubs needed for our iteration. These steps are nearly identical to the steps that we use last week.

[24:30]So let's pause here and take a look at the numbers. The value of Theta_0 increased from zero to 0.015, meaning that the Y-intercept shifted up slightly, which caused the curve to shift slightly to the left. The value of Theta_1 increased from one to 1.015, meaning that the slope of the curve increased slightly, so that it's slightly steeper as it transitions between zero and one. The other thing to note is that with a learning rate of 0.01, these incremental changes are very small, so we will need to repeat this process many times. Let's keep track of the values of these parameters.

[25:26]And now, let's recalculate the cost to see if our cost improved at all. So the value of our cost function decreased from 0.601 to 0.599. It's not a very big improvement, but it is an improvement. Let's keep track of this value, and then add the new curve to our plot.

[26:16]So, other than adding the epochs to the title, it doesn't look like anything has changed on this plot. Unfortunately, the change is too small to be noticeable. Let's set up a for loop so that we can iterate this process 1,000 times.

[28:04]It looks like there's been some improvement, but the improvement is definitely starting to slow down.

[28:31]Yeah, it's definitely slowing down. Let's run it 1,000 more times until we get to 4,000 epochs.

[28:40]Okay, so this time, the difference is barely noticeable, so let's stop here and talk about this plot.

[28:53]So, the way to read this plot is that the darkest blue line is the model that best predicts the presence of a wolf spider based on the size of the grains of sand on the beach. So the larger the grain size, the higher the probability, and the smaller the grain size, the lower the probability. Notice that the probability never gets close to zero. That's because based on this data, there's always some chance that there may be a wolf spider present, regardless of the grain size. Let's plot a learning curve to see our computer's journey as it went through this process.

[29:52]The values along the Y-axis are not important, but the overall shape of the curve is important. This learning curve shows that our computer was able to make significant improvements to the cost in their early iterations, but the improvement slowed down significantly after 2,000 epochs. This is consistent with what we saw visually.

[30:16]Next, let's plot the journey that the parameters went through during this process.

[30:41]As a reminder, Theta_0 is like the Y-intercept, and Theta_1 is like the slope. The parameters started their journey in the upper left corner with Theta_0 equal to zero and Theta_1 equal to one. Their journey ended in the lower right corner. Just like we saw visually, this plot shows that the Y-intercept increased initially, but eventually began to move down. On the other hand, the slope continued to increase with each iteration.

[31:19]Now that we have a working model, let's use it to make some predictions. Based on grain sizes of 0.25 mm, 0.5 mm, 0.75 mm, and 1 mm, what is the probability of a wolf spider being present? The output is in the Repl.

[31:46]So, based on this machine learning generated model, the probability of a wolf spider being present when the grain size is 0.25 mm is around 40%. At the other extreme, when the grain size is 1 mm, the probability of a wolf spider being present is around 97%. It's useful to see the actual probabilities, but since this is a classification problem, you may want to set up some rules to define the decision boundary for the classes. For example, you may want to say that if the probability is greater than or equal to 50%, then the result is a positive class, otherwise, the result is a negative class. So in our example, the class is negative when the grain size is 0.25 mm, and the class is positive for the larger sizes. Pretty cool, right?

[32:50]Today, we continued to explore supervised learning by getting an introduction to the other major subfield of supervised learning called classification. Specifically, we learned about the logistic regression algorithm. Now that we know the basics of linear regression and logistic regression, what are some of the key takeaways? Well, the machine learning workflow for solving both linear regression problems and logistic regression problems is the same. The only things we needed to change were the hypothesis function and the cost function. In terms of outputs, with linear regression, you're trying to create a model that predicts a continuous output. But with logistic regression, you're trying to create a model that predicts discrete outputs. Also, linear regression problems may be solved with or without machine learning. But you need the machine learning workflow in order to solve logistic regression problems. This is what machine learning brings to the table, and it's one of the reasons why it's become so popular. There are certain problems that can only be solved using machine learning. Hopefully, you're beginning to understand that machine learning is not some optional skill that's just nice to know. Instead, machine learning is an essential skill that everyone will be expected to know in the near future. Because machine learning is such a powerful approach to solving classification problems, there are many different classification algorithms in addition to logistic regression. Over the next few weeks, we'll explore several more classification algorithms, so stay tuned for more exciting adventures in machine learning.

[34:56]Well, that's all for today. If you made it this far, congratulations!

[35:04]If you enjoyed this video, and you feel like you learned something new, please give it a thumbs up. For more wholesome Julia tutorials, please be sure to subscribe and hit that bell. If you like what I do, then please consider joining and becoming a channel member. New tutorials are posted on Sundays/Mondays. Thanks for watching, and I'll see you in the next video.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript