As we saw in previous post addition of any number of layers and any number of nodes will only be able to model a linear functions. If we want our network to also be able to learn non linear functions, then we should introduce some non linearlity to the network. One way of adding non linearity is to use so called activation function. The idea is that we pass the output of each layer through a (non linear) function. Our examples so far can be considered to have been using a linear activation function. In particular, they all used identity activation function

Setup

Similar to last post we will build a two neuron and a single hidden layer.

W1

W2

x

a0

b1

b2

y

z1=W1a0+b1

Now we introduce a non linear function σ and pass this intermediate value through the function.

a1=σ(z1)

It can be said that, so far in our previous posts we had been using identity activation function. In the second layer similary it becomes

z2=W2a1+b2 a2=σ(z2)

So the final output of the network as a function of x input becomes

y=a2=σ(W2σ(W1x+b1)+b2)

Depending on the choice of the function σ, this can be either made linear or non linear in x. So we should be able to model any linear or non-function with this network. Like before let us try to learn a linear model through this primite network.

Lets again declare the loss function

L(y,a2)=(ya2)2

Where a2 is the true output and y is the output from the network. Now again we want the loss function to minimize as we approach the correct model value for the model parameters. With the same approach we want to get the output.

The Algorithm (again)

The rate of change of loss function with W2 can be similarly calculated as the derivative of loss with W2. But due to the presense of the activation function σ we get a different expression.

LW2=W2(ya2)2=2(ya2)a2W2

But from our expression we have

a2W2=W2(σ(z2))=σ(z2)z2W2=σ(z2)a1

Thus

LW2=2(ya2)σ(z2)a1

Again finding the rate of change of loss function L with respect to W1 is very similar with chain rule,

LW1=W1(ya2)2=2(ya2)a2W1a2W1=σ(z2)W1=σ(z2)z2W1=σ(z2)(W2σ(z1)+b1)W1=σ(z2)W2σ(z1)W1=σ(z2)W2σ(z1)z1W1=σ(z2)W2σ(z1)a0LW1=2(ya2)σ(z2)W2σ(z1)a0Lb2=2(ya2)σ(z2)Lb1=2(ya2)σ(z2)W2σ(z1)

In all of the four derivatives above we see that the expression term like (ya2)σ(z2) appear. It might be worthwile to name that expression. Since it is the derivative and we are going to decrease the parameters W2,W1,b2,b1 proportional to this quantity, lets call it the error parameter and denote by δ2.

(1)δ2=2(ya2)σ(z2)

We also see that when we find the derivative of the loss function with the first set of paramters W1 and b1, that the next weight W2 is present. We can define

(2)δ1=σ(z1)δ2W2

Using these two definitions we cn write the four derivatives as.

LW2=δ2a1Lb2=δ2LW1=δ1a0Lb1=δ1

If you compare these expressions to the expressions for the derivatives in the previous post , these are identical.

The important thing to note here is that with any number of layers and any activation function, we can strt from the last layer to calculate the error in last layer and propagate the error forward.

Implementation

Lets implement this algorithm in julia. First lets define the parameters

W¹, W², b¹,b² = 1.15,0.35, 0.0, 0.0

These are supposed to be random initial values to weights and biases. I have put constants here for the reproducibility of the result.

The loss function can be written as

L(y,a²) = (y-a²)^2
S(z)=11+ez
S(z) = 1 ./ ( 1 .+ exp.(-z))
S(z)=ez(1+ez)2
S′(z) = exp.(-z) ./ ( 1 .+ exp.(z))

The sigmoid function looks like

x = range(-5,5,100);
y = S.(x)
plot(x,y)

I would like to write a very sparse function to implement the feedforward and back propagation.

function run(x,y;σ=S,σ′=S′)
= y
    # Feed forward
    a⁰ = x
=* a⁰ += σ(z¹)
=*+= σ(z²)

    # Calculate error
    δL = (a² - ŷ)
    # Back propagate error
    δ² = δL * σ′(z²)
    δ¹ = δ² * σ′(z¹) *    # update biases
    global-= β * δ²
    global-= β * δ¹
    # update weights
    global-= β * δ²*    global-= β * δ¹*a⁰
    returnend

The code is exact one to one translation of mathematical expression we have defined. As explained in last post the weights and biases are updated in the direction opposite to the direction of gradient.

Lets generate some training samples

x_train = hcat(0:12...)
y_train = f.(x_train)

And train the network

epoch=1800
errs = []
for n in 1:epoch
    for (i,(x,y)) in enumerate(zip(x_train,y_train))
= run(x,y)
        error = loss(y,ŷ)
        push!(errs,error)
        (n % 50 == 0) && print(f"\rEpoch {$n:>2d}: {$i:>02d}/{$(length(x_train)):>02d}   error = {$error:.3e}")
    end
    (n % 50 == 0) && println("")
end
Epoch 50: 16/16   error = 7.246e-03
Epoch 100: 16/16   error = 6.995e-03
Epoch 150: 16/16   error = 6.750e-03
Epoch 200: 16/16   error = 6.509e-03
Epoch 250: 16/16   error = 6.273e-03
Epoch 1650: 16/16   error = 1.587e-03
Epoch 1700: 16/16   error = 1.483e-03
Epoch 1750: 16/16   error = 1.383e-03
Epoch 1800: 16/16   error = 1.286e-03

We see that the error goes down at the end of each epoch. Telling us that the back propagation works. But I can see that the error goes down much more slowly.

Conclusion

It is interesting to note that for the larger batch the error is not that improved from the early batches. I am guessing this is due to the fact that we are tyring to model a linear function with a non linear neural network. This tells us that our ability to train a particular dataset that might follow one functional form might not correctly be learned by a network that has other functional non linearlity.