I recommend reading the Pocket Guide to Linear Regression before proceeding here.

Classification

Classification is a supervised learning class in which we want to classify a target variable y{y} into a “class”, or “category”. This can be many types of tasks, for example we’ve looked at the relationship between “age” and “height” in Linear Regression. Classifying or labeling the data points into categories such as “Sex” or “Nationality” is called labeling.

GRAPH

Do you notice how some classes are easier to separate than others (age)? if you can place a line directly in between the classes, then we call the dataset linearly separable.

Here is a simpler example, imagine we are making a drink, depending on how many spoons of salt and sugar we put in, the drink will be either salty or sweet. Try to “fit” the line such that the data points are in the correct area:

The line that separates our classes is intuitively called the “Decision Boundary”.

Making Decisions

To classify the points we could decide that everything above the line is classified as 1{1} and everything below as 0{0}.

0{0} means our glass becomes salty. 1{1} means our glass becomes sweet.

The output of our linear regression model is continous, to get the classification outputs, we’ll wrap it in a “Step” Function that outputs 1{1} if y{y} > t{t} and 0{0} otherwise where t{t} is a threshold we’ll set or learn to minimize the number of misclassifications. The threshold can be anything, it doesn’t have to be zero. You can decide how you want to deal with points that fall directly onto the threshold, some people chose random assignments then, since the uncertainty is 100%.

An issue with the step function is that we can’t use local optimization strategies (like Gradient Descent which you’ll learn about in another) because the gradients are flat everywhere.

Additionally, a hard threshold is very restrictive, there is no uncertainty. Maybe not everything is so black and white and we want a probabilistic output, or confidence score if you will.

The Logistic Sigmoid function

To achieve this we swap the “Step” function for a special exponential function called “Sigmoid” to ensure the output stays in the interval 0{0} to 1{1} but is continous. An exponential function is simply a function of the form ab{a^b}, where a is raised to the power of b, and this means that we multiply a by itself b times. Usually, as in our case, a is eulers number e. So for example 23=222=8{2^3 = 2 \cdot 2 \cdot 2 = 8}.

11+ex\frac{1}{1 + e^{-x}}

We now let x{x} be the output of our model, i.e. the sigmoid becomes

11+e(mx+b)\frac{1}{1 + e^{-(mx + b)}}

Do you notice something as m{m} grows larger? If you make m{m} big enough σ(mx+b)step(mx+b){\sigma(mx + b) \to \text{step}(mx + b)}

Try using the logistic regression model below to correctly classify as many points as possible:

Graph, let people choose the label.

Measuring Error

As with Linear Regression we’ll want to quantify the different models and we can do so again with least squares using the following loss function:

L(m,b)=1ni(σ(mxi+b)yi)2L(m, b) = \frac{1}{n}\sum_{i}(\sigma(mx_{i} + b) - y_{i})^2
Finding the weights that reduce the output of the least squares function above is what we call Logistic “Regression”, it’s linear regression with a logistic function. However, we commonly use a different loss, because it finds the minimum faster and is a little more “gradient decent friendly”. We call it the log{\log} loss, and it is also often referred to as the Cross Entropy. The log{\log} is the inverse of the ab{a^b} function. An inverse of a function “f” undoes what “f” did. For example if “f” maps 2 to 5, then it’s inverse “g” maps 5 back to 2.

Log Loss

Next we’ll see a longer equation than we have so far, I promise to explain it and that it will be easy, even if you have seen these terms for the first time just now. We will define the error in this case with L(m,b)=1ni(yilogσ(mxi+b)(1yi)(log1σ(mxi+b)){L(m, b) = -\frac{1}{n}\sum_{i}(y_{i}\log{\sigma(mx_{i} + b)} - (1- y_{i})(\log{1- \sigma(mx_{i} + b)})}.

To understand the Log Loss all you have to know is that log{\log} is 0 if the input is 1{1} and log{\log} goes to negative infinity the closer it goes to zero. We want the loss to be positive (so we can minimize it with gradient descent), therefore we negate the log{\log} function. Here is a graph:

Beyond Binary Classification

Because of redundancy the Sigmoid function can classify two classes implicitly. Because if there are only two classes A and B, and each point has to be in one class, you know that if it isn’t class A (1{1}), it has to be class B 0{0}. Explicitly we can also classify two (or more) classes by normalizing.

From step to sign and sigmoid to softmax.

If we have more than two labels, i.e. multiple classes, the problem can take two forms:

  1. A datapoint can belong exactly 1{1} of n{n} classes.
  2. A datapoint can belong to multiple classes at once.

For problem one (1) imagine expanding our age categories from child and adult to include retirees, then a datapoint should be labelled;

  • 0 the person’s age is within Ages 0 - 17
  • 1 = Ages 18 - 64
  • 2 = Ages >= 65 There is no overlap and no person can be an adult and a child or a child and a retiree.

For problem two (2) imagine if there were overlap. We will be a little more liberal and label a person

  • 0 if they are between the ages 0 and 18
  • 1 if they are between the ages 18 and 65
  • 2 if they are 65 or older Now a person can belong to classes 0 and 1, or 1 and 2 simultaneously.

Multiple Logistic Regression Models

We can solve the multiclass model simply by having 3 Logistic regression models. It’s straightforward and a perfectly fine way to do it.