[0:05]The same classification trees we introduced in our previous videos can have problems with even the simplest data sets. So, let me just sketch one out so you'll see what I mean.
[0:26]The tree constructed for this dataset is relatively large. Remember, trees can only cut a feature space vertically or horizontally, so for a dataset like this one, the decision boundary would end up looking like a staircase. A better method for such datasets would be one that can find some arbitrary straight line to form the decision boundary. And there is such a method, it's called logistic regression. So, let me first use the logistic regression widget from the educational addon because it can also display the decision boundaries in two dimensions. Remember, to install a new add-on just find add-ons in the options menu. Okay, my educational add-on, for example, is already installed. So, the logistic regression widget we're looking for is called polynomial classification. And don't worry about the name, we'll explain it when we get to linear regression in a couple of episodes. Okay, I'll add this widget to the output of paint data. You can see the widget plots a straight line that represents my decision boundary. Excellent. Along the decision boundary, the probability of a point belonging to either class is equal. All the other lines you can see indicate where the probability predicted by logistic regression is still constant but leaning to one side, say point eight over here or point two over on this side. Now, notice that these probability lines in our contour plot are packed very close together. This is because the data I've drawn is clearly separated by a straight line, and the logistic regression model can be sure in predicting one class or the other when the point is just barely off the decision boundary. And I can change this by adding a few extra points to my dataset. I'll arrange the paint and polynomial classification widgets side by side. Then draw a couple of blue points in the red area. Now, you can see how the contour lines grow further apart. And let me do this again, but this time, pay close attention to the plot. First, I'll undo the changes I just made by pressing control Z. Now, the separation between the two classes is clear, and the contours are packed together again. Then, again, I'll add a few extra points like before, and maybe some more red points in the blue area like this. Okay, the contours are increasingly separated, but the decision boundary, that is the line with the class probability of 0.5, stays in about the same place. So, adding more and more noise, logistic regression gets cautious, lowering the certainty of its predictions near the decision boundary. And this is great. It means logistic regression not only finds a decision boundary, but also assesses the class probabilities around it. For a dataset with a clear boundary, the model will be fairly confident in its predictions, and the probabilities will be high. But the mercury the dataset, the more unsure the logistic regression will be, and the predicted probabilities will also be somewhat lower. Let's now compare the classification accuracy of logistic regression to that of classification trees on the employee attrition dataset that I introduced last time. Now, remember, the class variable in this dataset tells us if an employee has left the company, and our classifier should predict attrition based on other information about the employees from age, department, level of education, hourly rates, etcetera. Now, I'll estimate the accuracy of both methods using tenfold cross validation.
[4:32]And wow, logistic regression actually performs better than our tree. Actually, it performs much better if we take a look at AUC. Now, I know we still need to explain what exactly AUC is, but I promise we'll get to that in a little bit. Let's also compare it to a random forest. Now, random forests are one of the most powerful classifiers, but logistic regression seems to perform better on this dataset. And the truly amazing thing is, logistic regression is probably the simplest classifier you can think of. It just draws a separation line between two classes, that is in two dimensions. In higher dimensional spaces, it constructs a separation hyperplane. It's a linear classifier that defines decision boundaries as a weighted sum of features. And don't worry if that sounds a little bit technical, I'll explain what this means in my next video. Also, because of the simplicity of a weighted sum, I'll show you exactly how easy it is to explain the models constructed by logistic regression.



