[0:00]In the next 17 minutes, I will give you an overview of the most important machine learning algorithms to help you decide which one is right for your problem. My name is Tim and I have been a data scientist for over 10 years and taught all of these algorithms to hundreds of students in real life machine learning boot camps. There is a simple strategy for picking the right algorithm for your problem. In 17 minutes you will know how to pick the right one for any problem and get a basic intuition of each algorithm and how they relate to each other. My goal is to give as many of you as possible an intuitive understanding of the major machine learning algorithms to make you stop feeling overwhelmed. According to Wikipedia, machine learning is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data and thus perform tasks without explicit instructions. Much of the recent advancements in AI are driven by neural networks, which I hope to give you an intuitive understanding of by the end of this video. Let's divide machine learning into its sub fields. Generally machine learning is divided into two areas, supervised learning and unsupervised learning. Supervised learning is where we have a data set with any number of independent variables, also called features or input variables, and a dependent variable, also called target or output variable that is supposed to be predicted. We have a so-called training data set, where we know the true values for the output variable, also called labels, that we can train our algorithm on, to later predict the output variable for new unknown data. Examples could be predicting the price of a house, the output variable, based on features of the house, say square footage, location, year of construction, etcetera. Categorizing an object as a cat or a dog, the output variable or label, based on features of the object, say height, weight, size of the ears, color of the eyes, etcetera. Unsupervised learning is basically any learning problem that is not supervised, so where no truth about the data is known. So where a supervised algorithm would be like showing a little kid what a typical cat looks like and what a typical dog looks like, and then giving it a new picture and asking it what animal it sees. An unsupervised algorithm would be giving a kid with no idea of what cats and dogs are, a pile of pictures of animals and asking it to group by similarity without any further instructions. Examples of unsupervised problems might be to sort all of your emails into three unspecified categories, which you can then later inspect and name as you wish. The algorithm will decide on its own, how it will create those categories, also called clusters. Let's start with supervised learning, arguably the bigger and more important branch of machine learning. There are broadly two sub categories. In regression, we want to predict a continuous numeric target variable for a given input variable. Using the example from before, it could be predicting the price of a house, given any number of features of a house, and determining their relationship to the final price of the house. We might, for example, find out that square footage is directly proportional to the price, linear dependence, but that the age of the house has no influence on the price of the house. In classification, we try to assign a discrete categorical label, also called a class, to a data point. For example, we may want to assign the label spam or no spam to an email based on its content, sender, and so on. But we could also have more than two classes, for example, junk, primary, social promotions and updates, as Gmail does by default. Now let's dive into the actual algorithms, starting with the mother of all machine learning algorithms, linear regression. In general, supervised learning algorithms try to determine the relationship between two variables. We try to find the function that maps one to the other. Linear regression, in its simplest form, is trying to determine a linear relationship between two variables, namely the input and the output. We want to fit a linear equation to the data by minimizing the sum of squares of the distances between data points and the regression line. This simply minimizes the average distance of the real data to our predictive model, in this case, the regression line, and should therefore minimize prediction errors for new data points. A simple example of a linear relationship might be the height and shoe size of a person, where the regression fit might tell us that for every one unit of shoe size increase, the person will be on average 2 inches taller. You can make your model more complex and fit multi-dimensional data to an output variable. In the example of the shoe size, you might for example, want to include the gender, age, and ethnicity of the person to get an even better model. Many of the very fancy machine learning algorithms, including neural networks, are just extensions of this very simple idea, as I will show you later in the video. Logistic regression is a variant of linear regression and probably the most basic classification algorithm. Instead of fitting a line to two numerical variables, with a presumably linear relationship, you now try to predict a categorical output variable using categorical or numerical input variables. Let's look at an example. We now want to predict one of two classes, for example, the gender of a person based on height and uh weight. So a linear regression wouldn't make much sense anymore. Instead of fitting a line to the data, we now fit a so-called sigmoid function to the data, which looks like this.
[4:39]The equation will now not tell us about a linear relationship between two variables, but will now conveniently tell us the probability of a data point falling into a certain class, given the value of the input variable. So, for example, the likelihood of an adult person with a height of 180 centimeters being a man would be 80%. This is completely made up, of course. The K nearest neighbors algorithm or KNN, is a very simple and intuitive algorithm that can be used for both regression and classification. It is a so-called non-parametric algorithm. The name means that we don't try to fit any equations and thus find any parameters of a model. So no true model fitting is necessary. The idea of KNN is simply that for any given new data point, we will predict the target to be the average of its K nearest neighbors. While this might seem very simple, this is actually a very powerful predictive algorithm. Especially when relationships are more complicated than a simple linear relationship. In a classification example, we might say that the gender of a person will be the same as the majority of the five people closest in weight and height to the person in question. In a regression example, we might say that the weight of a person is the average weight of the three people closest in height and of chest circumference. This makes a ton of intuitive sense. You might realize that the number three seems a bit arbitrary and it is. K is called a hyper parameter of the algorithm and choosing the right K is an art. Choosing a very small number of K, say one or two, will lead to your model predicting your training data set very well, but not generalizing well to unseen data. This is called overfitting. Choosing a very large number, say 1,000, will lead to a worse fit overall, this is called under fitting. The best number is somewhere in between and depends a lot on the problem at hand. Methods for finding the right hyper parameters include cross validation but are beyond the scope of this video. Support vector machine is a supervised machine learning algorithm, originally designed for classification tasks, but it can also be used for regression tasks. The core concept of the algorithm is to draw a decision boundary between data points that separates data points of the training set as well as possible. As the name suggests, a new unseen data point will be classified according to where it falls with respect to the decision boundary. Let's take this arbitrary example of trying to classify animals by their weight and the length of their nose. In this simple case of trying to classify cats and elephants, the decision boundary is a straight line. The SGM algorithm tries to find the line that separates the classes with the largest margin possible. That is maximizing the space between the different classes. This makes the decision boundary generalize well and less sensitive to noise and outliers in the training data. The so-called support vectors are the data points that sit on the edge of the margin. Knowing the support vectors is enough to classify new data points, which often makes the algorithm very memory efficient. One of the benefits of SVM is that it is very powerful in high dimensions. That is if the number of features is large compared to the size of the data. In those higher dimensional cases, the decision boundary is called a hyperplane. Another feature that makes SVMs extremely powerful is the use of so-called kernel functions, which allow for the identification of highly complex non-linear decision boundaries. Kernel functions are an implicit way to turn your original features into new, more complex features using the so-called kernel trick, which is beyond the scope of this video. This allows for efficient creation of non-linear decision boundaries by creating complex new features such as weight divided by height squared, also called the BMI. This is called implicit feature engineering. Neural networks take the idea of implicit feature engineering to the next level, as I will explain later. Possible kernel functions for SVMs are the linear, the polynomial, the RBF, and the sigmoid kernel. Another fairly simple classifier is the Naive Bayes Classifier that gets its name from Bayes Theorem, which looks like this. I believe it's easiest to understand Naive Bayes with an example use case that it is often used for spam filters. We can train our algorithm with a number of spam and non-spam emails and count the occurrences of different words in each class, and thereby calculate the probability of certain words appearing in spam emails and non-spam emails. We can then quickly classify a new email based on the words it contains by using Bayes Theorem. We simply multiply the different probabilities of all words in the email together. This algorithm makes the false assumption that the probabilities of the different words appearing are independent of each other, which is why we call this classifier naive. This makes it very computationally efficient, while still being a good approximation for many use cases such as spam classification and other text-based classification tasks. Decision trees are the basis of a number of more complex supervised learning algorithms. In its simplest form, a decision tree looks somewhat like this. The decision tree is basically a series of yes-no questions that allow us to partition a data set in several dimensions. Here is an example. Decision tree for classifying people into high and low risk patients for heart attacks. The goal of the decision tree algorithm is to create so-called leaf nodes at the bottom of the tree, that are as pure as possible, meaning instead of randomly splitting the data, we try to find splits that lead to the resulting groups, or leaves, to be as pure as possible, which is to say that as few data points as possible are misclassified. While this might seem like a very basic and simple algorithm, which it is, we can turn it into a very powerful algorithm by combining many decision trees together. Combining many simple models to a more powerful complex model is called an ensemble algorithm. One form of assembling is bagging, where we train multiple models on different subsets of the training data using a method called bootstrapping. A famous version of this idea is called a random forest, where many decision trees vote on the classification of your data by majority vote of the different trees in the random forest. Random forests are very powerful estimators that can be used both for classification and regression. The randomness comes from randomly excluding features for different trees in the forest, which prevents over fitting and makes it much more robust because it removes correlation between the trees. Another type of ensemble method is called boosting, where instead of running many decision trees in parallel like for random forests, we train models in sequence, where each model focuses on fixing the errors made by the previous model. Boosted trees often get to higher accuracies than random forests, but are also more prone to over fitting.
[10:14]Its sequential nature makes it slower to train than random forests.
[10:21]Famous examples of boosted trees are AdaBoost, gradient boosting and XG Boost, the details of which are beyond the scope of this video. Now let's get to the reigning king of AI, neural networks. To understand neural networks, let's look at logistic regression again. Say we have a number of features and are trying to predict a target class. The features might be pixel intensities of a digital image and the target might be classifying the image as one of the digits from 0 to 9. Now, for this particular case, you might see why this might be difficult to do with logistic regression, because say the number one doesn't look the same when different people write it, and even if the same person writes it several times, it will look slightly different each time, and it won't be the exact same pixels illuminated for every instance of the number one. All of the instances of the number one have commonalities. However, they all have a dominating vertical line and usually no crossing lines as other digits might have, and usually there are no circular shapes in the number one, as there would be in the number eight or or nine. However, the computer doesn't initially know about these more complex features, but only the pixel intensities. We could manually engineer these features by measuring some of these things and explicitly adding them as new features. But artificial neural networks, similarly to using a kernel function with a support vector machine, are designed to implicitly and automatically design these features for us, without any guidance from humans. We do this by adding additional layers of unknown variables between the input and output variables. In its simplest form, this is called a single layer perceptron, which is basically just a multi-feature regression task. Now, if we add a hidden layer, the hidden variables in the middle layer represent some hidden unknown features, and instead of predicting the target variable directly, we try to predict these hidden features with our input features, and then try to predict the target variables with our new hidden features. In our specific example, we might be able to say that every time several pixels are illuminated next to each other, they represent a horizontal line, which can be a new feature to try and predict the digit in question, even though we never explicitly defined a feature called horizontal line. This is a much simplified view of what is actually going on, but hopefully this gets the point across. We don't usually know what the hidden features represent, we just train the neural network to predict the final target as well as possible. The hidden features we can design this way are limited in the case of the single hidden layer. But what if we add a layer and have the hidden layer predict another hidden layer? What if we now had even more layers? This is called deep learning and can result in very complex hidden features, so that might represent all kinds of complex information in the pictures, like the fact that there is a face in the picture. However, we will usually not know what the hidden features mean. We just know that they result in good predictions. All we have talked about so far is supervised learning, where we wanted to predict a specific target variable using some input variables. However, sometimes we don't have anything specific to predict and just want to find some underlying structure in our data, that's where unsupervised learning comes in. A very common unsupervised problem is clustering. It's easy to confuse clustering with classification, but they are conceptually very different. Classification is when we know the classes we want to predict and have training data with true labels available, shown as colors here, like pictures of cats and dogs. Clustering is when we don't have any labels and want to find unknown clusters just by looking at the overall structure of the data and trying to find potential clusters in the data. For example, we might look at a two dimensional data set that looks like this. Any human will probably easily see three clusters here. But it's not always as straightforward as your data might also look like this. We don't know how many clusters there are because the problem is unsupervised. The most famous clustering algorithm is called K-means clustering. Just like for KNN, K is a hyper parameter and stands for the number of clusters you are looking for. Finding the right number of clusters again is an art and has a lot to do with your specific problem and some trial and error and domain knowledge might be required. This is beyond the scope of this video. K-means is very simple. You start by randomly selecting centers for your K clusters and assigning all data points to the cluster center closest to them. The clusters here are shown in blue and green, you then recalculate the cluster centers based on the data points now assigned to them. You can see the centers moving closer to the actual clusters. You then assign the data points again to the new cluster centers, followed by recalculating the cluster centers. You repeat this process until the centers of the clusters have stabilized. While K-means is the most famous and most common clustering algorithm, other algorithms exist, including some where you don't need to specify the number of clusters, like hierarchical clustering and DB scan, which can find clusters of arbitrary shape, but I won't discuss them here. The last type of algorithm I will leave you with is dimensionality reduction. The idea of dimensionality reduction is to reduce the number of features or dimensions of your data set, keeping as much information as possible. Usually this group of algorithms does this by finding correlations between existing features and removing potentially redundant dimensions without losing much information. For example, do you really need a picture in high resolution to recognize the airplane in the picture or can you reduce the number of pixels in the image? As such, dimensionality reduction will give you information about the relationships within your existing features and it can also be used as a pre-processing step in your supervised learning algorithm to reduce the number of features in your data set and make the algorithm more efficient and robust. An example algorithm is principal component analysis or PCA. Let's say we are trying to predict types of fish based on several features like length, height, color, and number of teeth. When looking at the correlations of the different features, we might find that height and length are strongly correlated and including both won't help the algorithm much and might in fact hurt it by introducing noise. We can simply include a shape feature that is a combination of the two. This is actually extremely common in large data sets and allows us to reduce the number of features dramatically and still get good results. PCA does this by finding the directions in which most variance in the data set is retained. In this example, the direction of most variance is a diagonal, this is called the first principal component or PC and can become our new shape feature. The second principal component is orthogonal to the first and only explains a small fraction of the variance of the data set and can thus be excluded from our data set in this case. In large data sets, we can do this for all features and rank them by explained variance and exclude any principal components that don't contribute much to the variance and thus wouldn't help much in our ML model. This was all common machine learning algorithms explained. If you are overwhelmed and don't know which algorithm you need, here is a great cheat sheet by psychit learn that will help you decide which algorithm is right for which type of problem. If you want a road map on how to learn machine learning, check out my video on that.



