An Interactive Guide to the Wonderful World of Neural Networks

Hello!

I gave myself the challenge to learn as much as I could about neural networks within a week. I've noitced process of understanding it could be easy and you could learn what I learned within one to two days of what took me a week, while only knowing high school maths.

During my week I constantly reread Michael Nielsen's first two chapters and rewatched 3Blue1Brown's explanation with the same book as source material. While their explanations are great, there were a few things that made the process a lot longer than it should've been. I hope with this post to be able to make the introduction a bit more gentle by comparing some of the different mathematical notation they use and by using visualized interactive neural nets!

So what are artificial neural networks?

A machine learning model that is loosely inspired on the real thing -- a biological brain. Researchers were trying to mathematically model the brain back in the sixties. Instead of recreating an organism-like brain, they created a new technique to classify cats or dogs!

So like the real thing, the artificial version has:

neurons
and axons which link the neurons

And they are used to predict stuff! Such as whether there is a cat in a picture.

A trained artificial neural network does this well and an untrained one does this not well at all. So in order to know what artificial neural networks are, one really needs to know how it (1) predicts and (2) how it is trained.

Training an artifical neural network is mathematically speaking the harder part. For now we will start with the easier part: how does a neural network predict something?

Showing how a neural network predicts by modeling them as logic gates

Let's look at the smallest and simplest neural network possible. We are going to use raw values (non-normalized values) to make it even simpler. We'll tack on complexity as we go. For now, the idea is to show the whole process with the simplest of simplest 'networks' that I can think of. Once you understand the main ideas behind the most simple network, other things are much easier to understand.

The smallest network possible has two layers: one input layer and one output layer. The input layer contains 1 neuron and the output layer contains 1 neuron. Try to make the output neuron display a -5. Now try to make the output neuron display a -5 while $x = 1$

Input Layer

Output Layer

Did you get it? Do you know what process is happening for it to produce a -5?

This architecture is able to model $f(x) = wx$, so while it is not very useful, it is very simple to understand. More formally, the two sliders are determining the value of the input neuron $x$ and the weight $w$, which is between $x$ and the output neuron $a$. My slider values were: $w = -5$ and $x = 1$.

Code your own neural super simple neural network

We are going to code the feedforward function of our own neural network.

The goal is to implement the return statement in JSFiddle. Look at the model to see what it should be.

Automatically training the weights

The answer to the previous exercise was: return w * x.

Let's talk about how this neural network learns things automatically. So far, we trained our small humbe neural network by fumbling around with the parameters ourselves. We played with the weight $w$ of $x$. We trained them on one training example: $x = 1$.

Before we talk about how neural networks automatically learn through example data, let's quickly look at how I would learn a new accent through imitating another person's accent from a YouTube video. Learning by example!

I listen to how the person I want to imitate is pronouncing the word and then listen to how I am saying it. I then think about what the difference is, adjust the difference as much as I possibly can and then imitate again. When I try to imitate again, the difference between what I am doing and what I'm trying to imitate is hopefully smaller. I iterate on this process as much as possible until I do it good enough.

With neural networks, the output neuron has a certain value. That certain value has to match a desired value. By analogy of the previous paragraph, it would make sense that a neural network needs to learn more when that output is nowhere close the desired value compared to when it is quite close.

More formally this means that there are three topics that need to be discussed:

Defining how to calculate the difference between what is desired and whatever the neural network is doing. This is called the loss or cost function.
An algorithm for updating the parameters of the network (in this case only the weight $w$) by finding the lowest value for the cost function. This is called gradient descent
A specialized form of gradient descent by constantly applying the chain rule. This is called backpropagation.

Training a Network: Cost Functions Also known as: knowing how wrong your prediction is

Let's start with defining the cost function. The difference between the actual output value of a neuron and the desired value is called the error. For example, if the output neuron $a = 0.2$ and the desired value $y = 1$, then the error $e = 0.8 $. Therefore: $$ e = (a - y) \textrm{ and } a = wx $$

Maybe graphical example here?

Now that we have the error defined, we can define a cost function. A cost function is simply the same thing as calculating the error, with the difference that it is easy to use for calculating derivatives (this is needed for the gradient descent part). The idea of the cost function is the same as calculating an error: it should give us an indication of how much we're wrong. Because of this, there is no one right formula for defining a cost function. It's a mathematicians playground really, they get to do whatever they want as long as it makes sense for (1) knowing how wrong you are and (2) finding a way to become less wrong. In other words, all formula's are intended to minimize some error that we find important to reduce. This means that $$ C = (a - y) $$ can be seen as a cost function. It is not a very good one however, we need to calculate the partial derivative of it later and therefore a quadratic function will be more informative. Another one I've seen is this one: $$ C = (y - a)^2 $$

Intuitively this makes the most sense to look at if we are talking about finding the squared error. However, I prefer the following cost function because the partial derivative is cute and I like cute. $$ C = \frac{1}{2} (y - a)^2 $$

This is how the cost function looks graphically for the onlye one training example we have: $x = 1$. The y-axis is the cost $C$, the x-axis are possible values for the weight $w$. The desired value has been set to $y = -5$ in this graph.

From this we can infer that the best weight that matches the desired value as closely as possible is -5 (normally y-values and w-values don't overlap but in this case they're both -5). Woo! It's working! We have an idea of how wrong we are for every weight value for $w$!

But how did we see that? Well, we use our eyes and see that $w = -5$ is simply the lowest point. Computers need to do a similar thing. They somehow need to 'perceive' that $w = -5$ is the lowest point. One way of doing this is with an algorithm called gradient descent. This is especially handy in a high dimensional space where it is impossible to plot the graph.

With having a cost function defined, we can use that function to calculate the error which helps us to automatically adjust the weight! Currently we are able to do that analytically but in a higher dimensional space we can't do this. One way of doing this is by using gradient descent. There are other algorithms, but gradient descent is a basic algorithm used in many machine learning techniques. So the knowledge transfers quite well to other methods.

End with a visualization of what they've learned, summarize and give an exercise of programming their cost function

Training a Network: Gradient Descent Also known as: knowing how to automatically tune your weights so that you're almost always right

Derivatives and partial derivatives tell us how a function changes when we give a very gentle nudge in either the positive or negative direction. Consider a parabola $f(x) = x^2$ as our hypothetical cost function. Then $f'(x) = 2x$. If you want to go 'downhill', then all you need to do is slowly subtract $2x$ with a small value repeatedly. This small value is called the learning rate $\eta$ (pronounced as eta), so finally you'll need to subtract $$2x\eta$$ $$\textrm{as such: }x \leftarrow x - 2x\eta$$ Or in programming terms:

let x = x - 2 * x * eta

Another way to look at it is: $$ x \leftarrow x - f'(x) \eta $$

In the following example this is what we are going to see for the function $f(x) = x^2$. It will be a very slowed down version of gradient descent. The new function that will be repeatedly recalculated is $f(x) \leftarrow f(x - 2x\eta)$ and the learning rate $\eta = 0.01$.

$\eta$: 0.01

Give them a JSFiddle exercise to implement gradient descent. And then state the final paragraph

So we can just use gradient descent on our previous cost function right? And then we're done! Well, no. It's mathematically a bit more complex because the function isn't as simple as $x^2$, but if it were as simple as $x^2$, then yes. Yes, indeed my friend. Unfortunately, that isn't the case. We need to move onward.

Training a Network: Backpropagation Also known as: constantly applying the chain rule in order to perform gradient descent, but not telling anyone by giving it a fancy name

Fully written the cost function we use is: $$C = \frac{1}{2}(y - a)^2$$ $$\textrm{for which: } a = wx$$ $$\textrm{such that: } C = \frac{1}{2}(y - wx)^2$$ Which looks alike $x^2$ because something is squared, but it also has more variables in it with all kinds of meanings such as: $w$, $x$ and $y$. We want to find the partial derivative of $C$ with respect to $w$. Why with respect to $w$? Because that's the weight we want to automatically train! To differentiatie this function it needs the chain rule because there are secretly 2 functions in need of differentiation.

The first one is the $\frac{1}{2}(...)^2$ part which is written as the partial derivative of $C$ with respect to $a$ which is $\frac{\partial C}{\partial a}$.

The second one is the $wx$ part which is the partial derivative of $a$ with respect to $w$ which is $\frac{\partial a}{\partial w}$.

Combined, it can be written as $$\frac{\partial C}{\partial w} = \frac{\partial C}{\partial a}\frac{\partial a}{\partial w}$$ Let's take a look at both functions and then find the derivative.

$$C = \frac{1}{2}(y - a)^2, \textrm{ so } \frac{\partial C}{\partial a} = (a - y)$$ $$a = wx, \textrm{ and } \frac{\partial a}{\partial w} = x$$ $$\frac{\partial C}{\partial w} = \frac{\partial C}{\partial a}\frac{\partial a}{\partial w} = (a - y)x$$

And that is what backpropagation is: finding the partial derivative of $C$ with respect to all the weights in a network (in this case one). This means you have calculated the gradient (partial derivative) for weight $w$ by using the chain rule and now you can use the rest of the gradient descent algorithm in order to tune the weight.

Here is a graphical example of our particular network for which we have one training example $x=1$.

$\eta$: 0.01

It is important to note that I have omited many things about neural networks, but this is in essence the nitty gritty overview of how a neural network predicts and trains itself. In the next chapter I will create a network that has an activation function, multiple training examples (2, yes, a 100% increase) and a bias neuron.

End with JSFiddle and perhaps an updated visualization of the network?