Introduce a neural network fully then we dissect it

Maybe use the salt sweet water example

Neural Networks

Neural Networks have started the recent AI revolution. Yet, they are conceptually old, with the perceptron (something like a single neuron, see Logistic Regression Lesson) dating back to the 1940s.

You’ll get by just fine, but I do recommend going through the Pocket Guide to Logistic Regression before proceeding. We will start with the basic building blocks and look at how a neural network computes outputs (forward pass) and then how it uses those outputs to improve itself (backpropagation).

Basic Building Blocks

A neural network consists of three basic building blocks: a node (neuron), an activation function, and weights. An activation function takes in a nodes output y{y} and “activates” if the nodes output passes some rule. A node receives inputs x{x} and multiplies them with weights w{w}.

Imagine you want to bake a cake and want to calculate how many grams of sugar total y{y} you put in the bowl. You have differently sized spoons x{x} and need to keep track on how often you use them w{w}.

Behold the mini neural network to help you:

y=wxy = w \cdot x

x: 1y: 1.0w: 1

Activation

To keep it simple, we used a linear activation, which means that a(y)=y{a(y) = y}, what goes in comes out.

However, the magic of neural networks is in the non-linear activation functions.

Below are a Step Function and a ReLu Activation that you can choose from. The step function outputs 1{1} if the input is greater than 0{0} and 0{0} otherwise. The ReLu function outpus x{x} if x>0{x > 0} and 0{0} otherwise.

Imagine the activation functions representing different scenarios:

We want to know if our cake is sweet.

Step

Assume we need at least 1{1} grams of sugar to make our cake sweet and all levels of sweetness are equal to our tastebuds 1{1}. We can remove some sugar (negative w{w}) which will make our cake less sweet up to a point until there is no sugar in it anymore and the sweetness of our cake is equal to 0{0}.

Relu

Let’s continue to assume we need at least 1{1} grams of sugar to make our cake sweet, however we’re more realistic. 0{0} spoons of sugar still means 0{0} sweetness, but we assume a linear relationship from then on, i.e. every w{w} spoons with x{x} grams sugar increase the sweetness by just as much.
We can still remove some sugar (negative w{w}), but as before, all negative values are equally negative, because when there is no more sugar, trying to remove more will not change anything.

Use the graph below and switch between the activation functions to see how the influence our neural network.

a(y)=1.0x: 1y: 1.0w: 1

Let’s spice things up by having two inputs. The equation of the function the neural network represents is now:

a(y)=a(w1x1+w2x2)a(y) = a(w_1 \cdot x_1 + w_2 \cdot x_2)

Imagine x1{x_1} is still sugar and x2{x_2} is salt. Our neural network then calculates the perfect balance to determine if our cake will be sweet or not and how sweed (linear and ReLu).

a(y)=2.0x1: 1x2: 1y: 2.0w1: 1w2: 1

This can calculate some fancy functions. Yet, our neural network can become even fancier by adding a “hidden” layer. The math representing this neural network becomes a little longer, the logic stays the same though:

a(y)=a(wh1h1+wh2h2)a(y) = a(w_{h1} \cdot h_1 + w_{h2} \cdot h_2)
where,
h1=a(w11x1+w21x2),h2=a(w12x1+w22x2)h_1 = a(w_{11} \cdot x_1 + w_{21} \cdot x_2), h_2 = a(w_{12} \cdot x_1 + w_{22} \cdot x_2)

basically two copies of our earlier network.

x1: 1x2: 1h1: 2.0h2: 2.0a(y)=4.0y: 4.0

If you know a little linear algebra, and want some math, below I show you how you can represent the network in Matrix form.

Feel free to skip this part, none of the guide (except the follow up math sections) will depend on you understanding it. If you are supplementing your studies, I do highly recommend it, hoping it can make some of the often horribly presented theory more intuitive.

A little math

We can simplify the whole neural network layer as:

a(WX+b)a\left( W^{\intercal}X + b \right)
Where W{W} is the matrix of weights, each row i{i} represents the weights for the hidden neuron i{i} and the columns represent the inputs in x{x}.

The activation a(){a()} just means that the activation is applied to all of the elements in the output vector.

I cleaned up the graph to get some oversight, and give you intuition, refer to the same network above if you need more detail. Use your mouse to see which cells in the matrix influence which part of your neural network and vice versa.

W: 1111x: 11
1.01.0h1: 2.0h2: 2.0y: 4.0

I hope that by now you have an intuitive feeling how the neurons are connected, and how neurons in earlier layers influence the output of the next layer with their weights w{w} and activation functions.

This is the world of Neural Networks

W1.0 W1.0 W4.0 W1.0 W1.0 W1.0 1.0 1.0 5.0 2.0 7.0

But how do we optimize and update those weights (parameters) of our network? We simply break our steps down and retrace them as we would when trying to make the perfect cake:

  1. add w1{w_1} sugar
  2. add w2{w_2} salt
  3. add w3{w_3} dough
  4. mix sugar + salt + dough
  5. bake(mix) until t{t} ready
  6. taste

After going through the steps and cataloging what the input was of each of the 5 nodes, we taste the cake, make a decision on how good it tastes (loss function) log it, and try again by adjusting the weights (amounts) w1{w_1}, w2{w_2} and w3{w_3}and logging the difference in taste (our Loss function). We repeat and iteratively improve our cake in this fashion.

Backpropagation

This is more or less what we do with neural networks and backpropagation, to understand it intuitively, we will make use of a computation graph, which illustrates the concept well and makes it easy to understand. I promise.

Computation Graph

Let’s start with a simple computation graph to understand the concept. Our Computation Graph is still a neural network and it computes the function:

2(3x+x)2 \cdot ( 3 \cdot x + x)

To simplify things we name our nodes:

a=3xa = 3 \cdot x
b=a+xb = a + x
y=2by = 2 \cdot b

Some of you might notice that node b{b} is redundant, after all 3x+x=4x{3 \cdot x + x = 4 \cdot x}, and you are right. Play along, we’ll try something harder below. And use this to get some intuition for how to use the graph in updating our network using backpropagation.

x a b y

What is important is to find out how each node changes with relation (mathematicians say with respect to) to the input nodes value, lets look at node a.

a=3xa = 3 \cdot x
play around with x, can you tell that for each increase in x{x} the area of a{a} increases by 3 times as much? It’s a little easier to see this relationship when plotted in a graph:

What also helped me is to simple write out a small table.

x a
1 3
2 6
3 9
4 12
5 15

I hope you are convinced that indeed with 1 unit increase in x, a changes by 3.

If you haven’t had calculus before, this is the rate of change. It’s how sensitive a is to changes in x. We symbolize it like this:

ax\frac{\partial{a}}{\partial{x}}

Rise over run.

How about b?

How does b{b} change with x{x}?

b=a+xb = a + x

That’s a trick question, because b{b} depends not only on x{x} but also on a{a}. So bx{\frac{\partial{b}}{\partial{x}}} is sum of the change directly influenced by x{x}, and indirectly by a{a}.

The second first one is easy, x{x} changes with x{x} one to one, so xx=1{\frac{\partial{x}}{\partial{x}} = 1}.

Chain rule

For the second one we need the chain rule, when a variable b{b} indirectly depends on x{x} through another variable a{a}, we first find the rate of change of b{b} with respect to a{a} and then multiply it by the rate of change of a{a} with respect to x{x}.

Say our neural network computes the complicated looking function:

f(x)=y=2(sin(x)x2+x)f(x) = y = 2 \cdot (sin(x) \cdot x^2 + x)

To make the graph less cluttered we’ll name our nodes:

a=sin(x)a = sin(x)
b=x2b = x^2
c=abc = a \cdot b
d=c+xd = c + x
y=2dy = 2 \cdot d
x a b c d y

The universal function approximation theorem tells us that a sufficiently large neural network with a single hidden layer can approximate any function. So, …

Why Deep Neural Networks?

There are functions we can compute with a small neural network with only a few hidden layers that we would need a very large neural network with a single layer. Imagine the XOR (exclusive or) Function of 3{3} inputs. An XOR function should output 1{1} if only a single input equals 1{1} and the others are 0{0}. https://www.coursera.org/learn/neural-networks-deep-learning/lecture/rz9xJ/why-deep-representations

Excursion, proof of the chain rule

Here is a visual

y = 1 y = 3 y = 6 y = 1 y = 3 y = 6 y = 1 y = 3 y = 6