I recommend reading the Pocket Guide to Linear Regression before proceeding here.
Classification is a supervised learning class in which we want to classify a target variable into a “class”, or “category”. This can be many types of tasks, for example we’ve looked at the relationship between “age” and “height” in Linear Regression. Classifying or labeling the data points into categories such as “Sex” or “Nationality” is called labeling.
GRAPH
Do you notice how some classes are easier to separate than others (age)? if you can place a line directly in between the classes, then we call the dataset linearly separable.
Here is a simpler example, imagine we are making a drink, depending on how many spoons of salt and sugar we put in, the drink will be either salty or sweet. Try to “fit” the line such that the data points are in the correct area:
The line that separates our classes is intuitively called the “Decision Boundary”.
To classify the points we could decide that everything above the line is classified as and everything below as .
means our glass becomes salty. means our glass becomes sweet.The output of our linear regression model is continous, to get the classification outputs, we’ll wrap it in a “Step” Function that outputs if > and otherwise where is a threshold we’ll set or learn to minimize the number of misclassifications. The threshold can be anything, it doesn’t have to be zero. You can decide how you want to deal with points that fall directly onto the threshold, some people chose random assignments then, since the uncertainty is 100%.
An issue with the step function is that we can’t use local optimization strategies (like Gradient Descent which you’ll learn about in another) because the gradients are flat everywhere.
Additionally, a hard threshold is very restrictive, there is no uncertainty. Maybe not everything is so black and white and we want a probabilistic output, or confidence score if you will.
To achieve this we swap the “Step” function for a special exponential function called “Sigmoid” to ensure the output stays in the interval to but is continous. An exponential function is simply a function of the form , where a is raised to the power of b, and this means that we multiply a by itself b times. Usually, as in our case, a is eulers number e. So for example .
We now let be the output of our model, i.e. the sigmoid becomes
Do you notice something as grows larger? If you make big enough
Try using the logistic regression model below to correctly classify as many points as possible:
Graph, let people choose the label.
As with Linear Regression we’ll want to quantify the different models and we can do so again with least squares using the following loss function:
Next we’ll see a longer equation than we have so far, I promise to explain it and that it will be easy, even if you have seen these terms for the first time just now. We will define the error in this case with .
To understand the Log Loss all you have to know is that is 0 if the input is and goes to negative infinity the closer it goes to zero. We want the loss to be positive (so we can minimize it with gradient descent), therefore we negate the function. Here is a graph:
Because of redundancy the Sigmoid function can classify two classes implicitly. Because if there are only two classes A and B, and each point has to be in one class, you know that if it isn’t class A (), it has to be class B . Explicitly we can also classify two (or more) classes by normalizing.
If we have more than two labels, i.e. multiple classes, the problem can take two forms:
For problem one (1) imagine expanding our age categories from child and adult to include retirees, then a datapoint should be labelled;
For problem two (2) imagine if there were overlap. We will be a little more liberal and label a person
We can solve the multiclass model simply by having 3 Logistic regression models. It’s straightforward and a perfectly fine way to do it.