TubeScript Get a Transcript

Thumbnail for Machine Learning Fundamentals: The Confusion Matrix by StatQuest with Josh Starmer

Machine Learning Fundamentals: The Confusion Matrix

StatQuest with Josh Starmer

7m 6s898 words~5 min read

YouTube auto captions

Transcript source

YouTube auto captions

This transcript was extracted from YouTube's auto-generated caption track. The transcript below is server-rendered so it can be read, searched, cited, and shared without opening the original YouTube player.

Pull quotes

[0:00]Today we're going to cover another machine learning fundamental, the confusion matrix, and it's going to be clearly explained.

[0:00]We've got some clinical measurements like chest pain, good blood circulation, blocked arteries, and weight.

[0:00]And we want to apply a machine learning method to them to predict whether or not someone will develop heart disease.

[0:00]To do this, we could use logistic regression, or K-Nearest Neighbors, or a random forest, or some other method.

Use this transcript

Summarize a YouTube transcript Make study notes Find timestamped highlights Export to Markdown Download transcript files Browse related transcript hubs

Related transcript hubs

Transcript archive Auto Captions hub English transcripts AI transcripts

Watch on YouTube

Share

[0:00]If you feel confused... ...don't sweat it!!! ...StatQuest is here!!! StatQuest!!! Hello, I'm Josh Starmer and welcome to StatQuest. Today we're going to cover another machine learning fundamental, the confusion matrix, and it's going to be clearly explained. Imagine that we have this medical data. We've got some clinical measurements like chest pain, good blood circulation, blocked arteries, and weight. And we want to apply a machine learning method to them to predict whether or not someone will develop heart disease. To do this, we could use logistic regression, or K-Nearest Neighbors, or a random forest, or some other method. There are tons to choose from. How do we decide which one works best with our data? We start by dividing the data into training and testing sets. NOTE: This would be an excellent opportunity to use Cross Validation. And if you're not familiar with that, well, check out the stat quest. Then we train all of the methods we're interested in with the training data, and then test each method on the testing set. Now we need to summarize how each method performed on the testing data. One way to do this is by creating a confusion matrix for each method. The rows in a confusion matrix correspond to what the machine learning algorithm predicted... ...and the columns correspond to the known truth. Since there are only two categories to choose from: "Has Heart Disease" or "Does Not Have Heart Disease"... ...then the top left corner contains True Positives. These are the patients that had heart disease that were correctly identified by the algorithm. The True Negatives are in the bottom right-hand corner. These are the patients that did not have heart disease that were correctly identified by the algorithm. The bottom left-hand corner contains the False Negatives... False Negatives are when a patient has heart disease, but the algorithm said they didn't. Lastly, the top right-hand corner contains the False Positives... False Positives are patients that do not have heart disease, but the algorithm says they do. For example, when we applied the Random Forest to the testing data... There were 142 True Positives, patients with heart disease that were correctly classified. and 110 True Negatives, patients without heart disease that were correctly classified. However, the algorithm misclassified 29 patients that did have heart disease by saying that they did not (False Negatives)... ...and the algorithm misclassified 22 patients that did not have heart disease by saying that they did (False Positives). The numbers along the diagonal (the Green Boxes) tell us how many times the samples were correctly classified. The numbers not on the diagonal (the Red Boxes) are samples the algorithm messed up. Now we can compare the Random Forest's Confusion Matrix... to the Confusion Matrix we get when we use K-Nearest Neighbors. K-Nearest Neighbors was worse than the Random Forest at predicting patients with Heart Disease (107 vs 142)... ...and worse at predicting patients without Heart Disease (79 vs 110)... ...so if we had to choose between using the Random Forest and K-Nearest Neighbors, we would choose the Random Forest. BAM!!! Lastly, we can apply Logistic Regression to the Testing Dataset and create a Confusion Matrix. These two Confusion Matrices are very similar and make it hard to choose which machine learning method is a better fit for this data. We'll talk about more sophisticated metrics, like Sensitivity, Specificity, ROC and AUC, that can help us make a decision in the next StatQuests. Now that we have the basic confusion matrix figured out, let's look at a more complicated one. Here's a new data set. Now the question is, based on what people think of these movies, Jurassic Park 3, Run for Your Wife, Out Cold, spelled with a K, and Howard the Duck, can we use a machine learning method to predict their favorite movie? If the only options for favorite movie were Troll 2, Gore Police or Cool As Ice... ...then the confusion matrix would have 3 rows and 3 columns. But just like before, the diagonal (the Green Boxes) are where the machine learning algorithm did the right thing... ...and everything else is where the algorithm messed up. In this case, the machine learning algorithm didn't do very well, but can you blame it? These are all terrible movies! BAM. Ultimately, the size of the confusion matrix is determined by the number of things we want to predict. In the first example, we were only trying to predict two things: if someone had heart disease if of they didn't... ...and that gave us a confusion matrix with 2 rows and 2 columns. In the second example, we had three things to choose from... ...and a confusion matrix with 3 rows and 3 columns. If we had 4 things to choose from, we get a confusion matrix with 4 rows and 4 columns... ...and if we had 40 things to choose from, we get a confusion matrix with 40 rows and 40 columns. Double BAM!!!! In summary, a Confusion Matrix tells you what your machine learning algorithm did right... ...and what it did wrong. Hooray! We've made it to the end of another exciting StatQuest. If you like this StatQuest and want to see more, please subscribe. And if you want to support StatQuest, well, consider buying one or two of my original songs. All right, until next time, quest on.

MORE TRANSCRIPTS

Thumbnail for Assamese story/Assamese gk/Assamese motivational gk story/gk story Assamese/gk /Assamese g k/story by মই পাৰিম Motivational speech

Assamese story/Assamese gk/Assamese motivational gk story/gk story Assamese/gk /Assamese g k/story

মই পাৰিম Motivational speech

Thumbnail for Quick Guide to Aquatic Planet Pack DLC - Oxygen Not Included by Grind This Game

Quick Guide to Aquatic Planet Pack DLC - Oxygen Not Included

Grind This Game

Thumbnail for Трейдинг с нуля: объяснил ПРОСТО каждую деталь by КриптоБош

Трейдинг с нуля: объяснил ПРОСТО каждую деталь

КриптоБош

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript