Maybe use the salt sweet water example
Neural Networks have started the recent AI revolution. Yet, they are conceptually old, with the perceptron (something like a single neuron, see Logistic Regression Lesson) dating back to the 1940s.
You’ll get by just fine, but I do recommend going through the Pocket Guide to Logistic Regression before proceeding. We will start with the basic building blocks and look at how a neural network computes outputs (forward pass) and then how it uses those outputs to improve itself (backpropagation).
A neural network consists of three basic building blocks: a node (neuron), an activation function, and weights. An activation function takes in a nodes output and “activates” if the nodes output passes some rule. A node receives inputs and multiplies them with weights .
Imagine you want to bake a cake and want to calculate how many grams of sugar total you put in the bowl. You have differently sized spoons and need to keep track on how often you use them .
Behold the mini neural network to help you:
To keep it simple, we used a linear activation, which means that , what goes in comes out.
However, the magic of neural networks is in the non-linear activation functions.
Below are a Step Function and a ReLu Activation that you can choose from. The step function outputs if the input is greater than and otherwise. The ReLu function outpus if and otherwise.
Imagine the activation functions representing different scenarios:
We want to know if our cake is sweet.
Assume we need at least grams of sugar to make our cake sweet and all levels of sweetness are equal to our tastebuds . We can remove some sugar (negative ) which will make our cake less sweet up to a point until there is no sugar in it anymore and the sweetness of our cake is equal to .
Let’s continue to assume we need at least grams of sugar to make our cake sweet, however we’re more realistic. spoons of sugar still means sweetness, but we assume a linear relationship from then on, i.e. every spoons with grams sugar increase the sweetness by just as much.
We can still remove some sugar (negative ), but as before, all negative values are equally negative, because when there is no more sugar,
trying to remove more will not change anything.
Use the graph below and switch between the activation functions to see how the influence our neural network.
Let’s spice things up by having two inputs. The equation of the function the neural network represents is now:
Imagine is still sugar and is salt. Our neural network then calculates the perfect balance to determine if our cake will be sweet or not and how sweed (linear and ReLu).
This can calculate some fancy functions. Yet, our neural network can become even fancier by adding a “hidden” layer. The math representing this neural network becomes a little longer, the logic stays the same though:
basically two copies of our earlier network.
If you know a little linear algebra, and want some math, below I show you how you can represent the network in Matrix form.
Feel free to skip this part, none of the guide (except the follow up math sections) will depend on you understanding it. If you are supplementing your studies, I do highly recommend it, hoping it can make some of the often horribly presented theory more intuitive.
We can simplify the whole neural network layer as:
The activation just means that the activation is applied to all of the elements in the output vector.
I cleaned up the graph to get some oversight, and give you intuition, refer to the same network above if you need more detail. Use your mouse to see which cells in the matrix influence which part of your neural network and vice versa.
I hope that by now you have an intuitive feeling how the neurons are connected, and how neurons in earlier layers influence the output of the next layer with their weights and activation functions.
This is the world of Neural Networks
But how do we optimize and update those weights (parameters) of our network? We simply break our steps down and retrace them as we would when trying to make the perfect cake:
After going through the steps and cataloging what the input was of each of the 5 nodes, we taste the cake, make a decision on how good it tastes (loss function) log it, and try again by adjusting the weights (amounts) , and and logging the difference in taste (our Loss function). We repeat and iteratively improve our cake in this fashion.
This is more or less what we do with neural networks and backpropagation, to understand it intuitively, we will make use of a computation graph, which illustrates the concept well and makes it easy to understand. I promise.
Let’s start with a simple computation graph to understand the concept. Our Computation Graph is still a neural network and it computes the function:
To simplify things we name our nodes:
Some of you might notice that node is redundant, after all , and you are right. Play along, we’ll try something harder below. And use this to get some intuition for how to use the graph in updating our network using backpropagation.
What is important is to find out how each node changes with relation (mathematicians say with respect to) to the input nodes value, lets look at node a.
What also helped me is to simple write out a small table.
x | a |
---|---|
1 | 3 |
2 | 6 |
3 | 9 |
4 | 12 |
5 | 15 |
I hope you are convinced that indeed with 1 unit increase in x, a changes by 3.
If you haven’t had calculus before, this is the rate of change. It’s how sensitive a is to changes in x. We symbolize it like this:
Rise over run.
How does change with ?
That’s a trick question, because depends not only on but also on . So is sum of the change directly influenced by , and indirectly by .
The second first one is easy, changes with one to one, so .
For the second one we need the chain rule, when a variable indirectly depends on through another variable , we first find the rate of change of with respect to and then multiply it by the rate of change of with respect to .
Say our neural network computes the complicated looking function:
To make the graph less cluttered we’ll name our nodes:
The universal function approximation theorem tells us that a sufficiently large neural network with a single hidden layer can approximate any function. So, …
There are functions we can compute with a small neural network with only a few hidden layers that we would need a very large neural network with a single layer. Imagine the XOR (exclusive or) Function of inputs. An XOR function should output if only a single input equals and the others are . https://www.coursera.org/learn/neural-networks-deep-learning/lecture/rz9xJ/why-deep-representations
Here is a visual