Before we get to neural networks, you must have a very basic understanding about how information flows forwards and backwards through an equation.
Calculus
Calculus is absurd, and for some bizzare reason it seems to be the language of universe. It is the ultimate order, the closer you zoom into the world, the smoother it becomes. Imagine a triangle, then square, then pentagon, then hexagon, and now imagine infinitygon, with infinitely many sides, what would the difference be between the infinitygon and a circle? Does the circle have infinitely many walls or no walls at all? Some people say that underneath it there is pure chaos, and this is the true face of our world, because it seems that the more we zoom in, the weirder things are. The fact that circles exist is absurd.
Calculus was developed independently by Newton and Leibniz in the late 1670s, it is an attempt to understand change, how things change and how the change affects their relationships. There are two main operations in calculus, differentiation, which determines the rate of change, and integration which accumulates change.
I will try to give you small intuition about how change flows through an equation, and how it flows backwards, our ultimate goal is to understand how exactly each input parameter affects the output.
Think for a second about the following equation, and how changing a
and b
affects c
c = a + b
If you increase a
just a bit, lets say with 1, then c
(the output) will
increase with 1, if you increase b
with 1, then c
will also increase with 1.
c = a * b
However if we do multiplication, when we increase a
with 1, then c
will
increase with b
, imagine c = 3 * 6
, if we increase a
to 4, so c = 4 * 6
,
then c
will increse with 6. And if we increase b
with 1 then c
will
increase with a
.
Now, if c = a + b
and d = e * f
and g = c * d
, then how would changing a
affect the output g
, lets break it down
a --.
+ --> c --.
b --' \
`- * --> g
e --. /
* --> d --'
f --'
So, a + b
produces c
, and e * f
produces d
, then c * d
produces g
,
now put some imaginary values everywhere but leave a
as variable, Imagine it
as a knob that you can rotate.
a --.
+ --> c --.
b 3 --' \
`- * --> g
e 6 --. /
* --> d --'
f 4 --' 24
Lets imagine some intial value of it, e.g. 5, so a = 5
a 5 --. 8
+ --> c --.
b 3 --' \ 192
`- * --> g
e 6 --. /
* --> d --'
f 4 --' 24
Now, if we rotate the knob to the right a little bit, if we increase a
with 1, c
will increase with 1, from 8 to 9, then g
will increase with d
, from 192
to 216. And if we decrese it a bit, we will get from 192 to 168.
So you can see how sensitive is g
to a
, now lets do e
, and again we will
initialize the knob at 6.
a 5 --. 8
+ --> c --.
b 3 --' \ 192
`- * --> g
e 6 --. /
* --> d --'
f 4 --' 24
If we increase e
with 1, d
will increase with 4, and then g
will increase
with c*4
, or in our case, 32, so so turning the knob a bit on e
increases
g
with 32.
The equasion is still too smal for you to see the power of those relations, now we have c = a + b; d = e * f; g = c * d
, lets add one more, k = m * p; r = k * g
a 5 --. 8
+ --> c --.
b 3 --' \ 192
* --> g --.
e 6 --. / \
* --> d --' \ 2304
f 4 --' 24 `- * --> r
/
/
m 4 --. 12 /
* --> k -------------'
p 3 --'
Now if we increase e
with 1, from 6 to 7, how is r
going to change? Just
walk though it, how would d
change, then how would that affect g
and then
how would that affect r
. d
will increase with 4, from 24 to 28, then g
will increase c*4
or 32, and then this will increase r
with 32*k
, or 384,
so r
will become 2688, lets verify:
with e = 7
(5 + 3) * (7 * 4) * (4 * 3) = 2688
and with e = 6
(5 + 3) * (6 * 4) * (4 * 3) = 2304
The interesting part is, the value of d
is not important, its change is important, you see g
will increase with [the change of d] * c
, and if we go up a bit, the value of g
is nor important, r
will increase with [the change of g] * k
.
The change in r
with respect to e
is the change in r
with respect to g
(which is k
) times the change of g
with respect to e
, which is c
, times the change in d
with respect to e
, which is f
. you see at each step, we do not actually care about anything besides the how it affects its output, and how it is affected by its inputs.
As put by George F. Simmons: "If a car travels twice as fast as a bicycle and the bicycle is four times as fast as a walking man, then the car travels 2 × 4 = 8 times as fast as the man."
This allows us to go backwards and know the "strength" at each node, how should we change it in order to get the output to do what we want. For example imagine we want to "teach" an equation to always produce the number 1, we give it 3 inputs a,b,c and we want the output to always be 1.
a
\
b - [ black magic ] -> 1
/
c
We can start with a simple black magic, (a * w1 + b * w2 + c * w3) * w4
a --.
* -- aw1 -.
w1 --' \ w4 --.
+ -- aw1bw2 --. \
b --. / \ * -- result
* -- bw2 -' \ /
w2 --' + -- aw1bw2cw3 --'
/
c --. /
* -- cw3 ------------------'
w3 --'
I added the intermediate nodes, like aw1bw2, so we can just talk about them, but we can only change w1, w2, w3 and w4, nothing else, as we don't control the input.
In order to teach our black magic we will have lots of examples, like a=3,b=4,c=6 and we expect 1, a=1,b=2,c=3 we expect 1, a=3,b=4,c=1, we expect 1. we will initialize w1 w2 w3 w4 all with some random value, lets pick the very random value of 0.5
3 a --. 1.5
* -- aw1 -.
0.5 w1 --' \ 3.5 0.5 w4 --.
+ -- aw1bw2 --. \
4 b --. / \ * -- r
* -- bw2 -' \ /
0.5 w2 --' 2 + -- aw1bw2cw3 --'
/ 6.5
6 c --. /
* -- cw3 ------------------'
0.5 w3 --' 3
We will use the first example where a=3, b=4 and c=6, the result is 3.25, (3*0.5 + 4*0.5 + 6*0.5) * 0.5
, we expected 1, so our black magic has betrayed us, we must go backwards and turn the knobs on w4 w3 w2 w1 next time we do better. We know we have overshot our expected value, so we must turn the knobs in such way that our output gets smaller, lets start turning!
If we change w4 a bit, r will change with aw1bw2cw3, and since our r is 3.25 and we want 1, and aw1bw2cw3 is 6.5, we will use a "step" of 0.1 so 6.5 * 0.1 is 0.65, so we will decrease w4 by 0.65 or 0.5 - 0.65 = -0.15, the new value for w4 will be -0.15.
If we change w3 a bit, how will that affect r? Well, the change in r with respect to w3 is the change in r with respect to aw1bw2cw3 (which is w4) times the change in aw1bw2cw3 with respect to w3 (which is c). You might have noticed we just jumped over the +, that is because + just passes the change through it. Since w4 is 0.5 and c is 6, when we change w3 by a small amount, r will change by 0.5 * 6 = 3 for each unit change in w3. Using our step size of 0.1, we should adjust w3 by: 0.1 * 3 = 0.3. So w3's new value will be: 0.5 - 0.3 = 0.2 For w2, we do the same:
When we change w2, it affects aw1bw2cw3 by b (which is 4), So changing w2 by 1 changes r by: w4 * 4 = 0.5 * 4 = 2. With our 0.1 step: 2 * 0.1 = 0.2. So w2's new value: 0.5 - 0.2 = 0.3.
And finally for w1. When we change w1, it affects aw1bw2cw3 by a (which is 3). So changing w1 by 1 changes r by: w4 * 3 = 0.5 * 3 = 1.5, With our 0.1 step: 1.5 * 0.1 = 0.15. So w1's new value: 0.5 - 0.15 = 0.35.
a = 3
b = 4
c = 6
w1 = 0.35
w2 = 0.3
w3 = 0.2
w4 = -0.15
r = (a * w1 + b * w2 + c * w3) * w4
r
is now -0.5175, we overshot our goal! but we are a bit closer to 1
than we
were before, now we get another example, a=1,b=2,c=3, and we try again, adjust
the parameters a bit to get us closer to the expected result. Given enough,
examples (called a training set), expected results (called labels), and a way to
compare the expected result to the actual result (called a loss function), we
can teach a the black magic box to "learn" any pattern, and even "reason", we
can teach it to count, or to sort things, we can teach it to speak, or to
listen, to read and to write, to understand us and to understand itself.
This is the very core of how we teach machines, the way information flows
backwards, how the +
routes the change to all its input nodes, and *
switches from one to the other. The only missing part, at the moment our black box can only learn linear things, straight lines, it is not possible for it to learn a circle, we just have to allow it express itself, there is a function called ReLU (rectified linear unit), which is:
def relu(x):
if x < 0:
return 0
return x
If its input is < 0 it returns 0, otherwise it returns the input, this simple function allows the network to selectively kill the change flow, and to 'turn off' certain paths in order to be able to learn infinitely complex patterns.
This function is called 'activation function', there are many like it, sigmoid, tanh, gelu, etc, it doesnt matter, its purpose is to allow the network to express itself.
^
10| /
| /
| /
| /
| /
| /
| /
| /
| /
| /
=========================+----------------------
-10 0 10
After 0, the function is a line, before 0, the function is a line, but at 0, where the change from 0 to x, is where the nonlinearity happens.
I have not named things with their names, and that is ok, just think about + and *, and what they mean forward and backwards.