[0:00]If you feel confused... ...don't sweat it!!! ...StatQuest is here!!! StatQuest!!! Hello, I'm Josh Starmer and welcome to StatQuest. Today we're going to cover another machine learning fundamental, the confusion matrix, and it's going to be clearly explained. Imagine that we have this medical data. We've got some clinical measurements like chest pain, good blood circulation, blocked arteries, and weight. And we want to apply a machine learning method to them to predict whether or not someone will develop heart disease. To do this, we could use logistic regression, or K-Nearest Neighbors, or a random forest, or some other method. There are tons to choose from. How do we decide which one works best with our data? We start by dividing the data into training and testing sets. NOTE: This would be an excellent opportunity to use Cross Validation. And if you're not familiar with that, well, check out the stat quest. Then we train all of the methods we're interested in with the training data, and then test each method on the testing set. Now we need to summarize how each method performed on the testing data. One way to do this is by creating a confusion matrix for each method. The rows in a confusion matrix correspond to what the machine learning algorithm predicted... ...and the columns correspond to the known truth. Since there are only two categories to choose from: "Has Heart Disease" or "Does Not Have Heart Disease"... ...then the top left corner contains True Positives. These are the patients that had heart disease that were correctly identified by the algorithm. The True Negatives are in the bottom right-hand corner. These are the patients that did not have heart disease that were correctly identified by the algorithm. The bottom left-hand corner contains the False Negatives... False Negatives are when a patient has heart disease, but the algorithm said they didn't. Lastly, the top right-hand corner contains the False Positives... False Positives are patients that do not have heart disease, but the algorithm says they do. For example, when we applied the Random Forest to the testing data... There were 142 True Positives, patients with heart disease that were correctly classified. and 110 True Negatives, patients without heart disease that were correctly classified. However, the algorithm misclassified 29 patients that did have heart disease by saying that they did not (False Negatives)... ...and the algorithm misclassified 22 patients that did not have heart disease by saying that they did (False Positives). The numbers along the diagonal (the Green Boxes) tell us how many times the samples were correctly classified. The numbers not on the diagonal (the Red Boxes) are samples the algorithm messed up. Now we can compare the Random Forest's Confusion Matrix... to the Confusion Matrix we get when we use K-Nearest Neighbors. K-Nearest Neighbors was worse than the Random Forest at predicting patients with Heart Disease (107 vs 142)... ...and worse at predicting patients without Heart Disease (79 vs 110)... ...so if we had to choose between using the Random Forest and K-Nearest Neighbors, we would choose the Random Forest. BAM!!! Lastly, we can apply Logistic Regression to the Testing Dataset and create a Confusion Matrix. These two Confusion Matrices are very similar and make it hard to choose which machine learning method is a better fit for this data. We'll talk about more sophisticated metrics, like Sensitivity, Specificity, ROC and AUC, that can help us make a decision in the next StatQuests. Now that we have the basic confusion matrix figured out, let's look at a more complicated one. Here's a new data set. Now the question is, based on what people think of these movies, Jurassic Park 3, Run for Your Wife, Out Cold, spelled with a K, and Howard the Duck, can we use a machine learning method to predict their favorite movie? If the only options for favorite movie were Troll 2, Gore Police or Cool As Ice... ...then the confusion matrix would have 3 rows and 3 columns. But just like before, the diagonal (the Green Boxes) are where the machine learning algorithm did the right thing... ...and everything else is where the algorithm messed up. In this case, the machine learning algorithm didn't do very well, but can you blame it? These are all terrible movies! BAM. Ultimately, the size of the confusion matrix is determined by the number of things we want to predict. In the first example, we were only trying to predict two things: if someone had heart disease if of they didn't... ...and that gave us a confusion matrix with 2 rows and 2 columns. In the second example, we had three things to choose from... ...and a confusion matrix with 3 rows and 3 columns. If we had 4 things to choose from, we get a confusion matrix with 4 rows and 4 columns... ...and if we had 40 things to choose from, we get a confusion matrix with 40 rows and 40 columns. Double BAM!!!! In summary, a Confusion Matrix tells you what your machine learning algorithm did right... ...and what it did wrong. Hooray! We've made it to the end of another exciting StatQuest. If you like this StatQuest and want to see more, please subscribe. And if you want to support StatQuest, well, consider buying one or two of my original songs. All right, until next time, quest on.

Machine Learning Fundamentals: The Confusion Matrix
StatQuest with Josh Starmer
7m 6s898 words~5 min read
YouTube auto captions
Transcript source
YouTube auto captions
This transcript was extracted from YouTube's auto-generated caption track. The transcript below is server-rendered so it can be read, searched, cited, and shared without opening the original YouTube player.
Pull quotes
[0:00]Today we're going to cover another machine learning fundamental, the confusion matrix, and it's going to be clearly explained.
[0:00]We've got some clinical measurements like chest pain, good blood circulation, blocked arteries, and weight.
[0:00]And we want to apply a machine learning method to them to predict whether or not someone will develop heart disease.
[0:00]To do this, we could use logistic regression, or K-Nearest Neighbors, or a random forest, or some other method.
Use this transcript
Related transcript hubs
Watch on YouTube
Share
MORE TRANSCRIPTS


