MCB111: Mathematics in Biology (Spring 2018)


week 08:

Neural Networks - Learning as Inference

In this lecture, I follow David Mackay very closely. In particular his lectures 15 and 16, which correspond to Chapters 39, 41 and 42 of his book Information Theory, Inference, and Learning algorithms.

Feedforward networks

A single neuron

A single neuron (Figure 1) has

The neuron adds up the weighted sum of the inputs into a variable called the activation ,


Figure 1. One neuron (from D. Mackay's chapter 39).

where called the bias is the activation in the absence of inputs.

The activity of the neuron is a function of the activation . Several commonly used forms for the activity are

Depending in how we combine many single neurons, there is essentially two types of networks: feedforward networks, where all the information flows in one direction, and feedback networks where all nodes are connected (Figure 2).


Figure 2. Two type of networks (from D. Mackay's Lecture 15).

Parts of a single neuron

As a recap, the basic concepts of a neuron are

and , so we can add the bias in the same equation.


Figure 3. Neuron activity for a neuron with two inputs as a function of the inputs. The values of the weights are: w0 = -15, w1 = 2 w2 = 1. We use the logistic function to describe the activity of the neuron.

The space of weights

Consider the simple situation of only two input to the neuron , and only two weights and no bias. The neuron’s activation is given by

and we can plot it as a function of the inputs as in Figure 3.


Figure 4. Effect on neuron activity of a zero weight.

If a weight is zero, there is no activity change associated with changes in the corresponding input, as described in Figure 4.


Figure 5. Effect on neuron activity of the bias term.

The effect of changing the value of the bias can be observed in Figure 5.

The effect of changing the value of the weights can be observed in Figure 6.


Figure 6. Effect on neuron activity of changes in one of the weights.

We can also use contour plots, in which we represent in the inputs space lines corresponding to different values of the activity . Using contour plots we observe in Figure 7 the effect of making the weight larger or smaller in absolute magnitude. Doubling the weights makes the contour lines closer to each other, while halving the weight make the same contour lines wider.


Figure 7. Effect on neuron activity of scaling the weights. The contours correspond to the values a = 0.0, 1, and -1. Arrows point in the direction of the weight vector.

What a single neuron can learn: to be a binary classifier

Imagine that you want to separate apples from oranges from a bucket in which they are all mixed up. Would you trust a single neuron to do that?

The learning rule

The central idea of supervised learning is this: given a number of examples of the input vector and their target output , we hope the neuron will learn their relationship (whichever that is).

Training the network requires finding the values of the weights that best fit the training data. If the neuron is well trained, given an input will produce an output which is very close to the target .

Data:

Outputs:

Error: where we expect these errors to be small.

Thus “learning” is equivalent to adjusting the parameters (weights) of the network such that the output of the network is close to the target for all examples.

Have we done this before in a different context?

In our apples and oranges example, let’s give an assignment of for an apple, and for an orange. In a well trained neuron will produce assignments similar to these:

Orange_1  y(Orange_1, w) = 0.01
Apple_1   y(Apple_1,  w) = 0.91
Orange_2  y(Orange_2, w) = 0.09
Orange_3  y(Orange_3, w) = 0.05
Apple_2   y(Apple_1,  w) = 0.97

How to do that? We find a function to optimize.

The error function

For each input , we introduce an objective function that will measure how close the neuron output is to . That objective function is also called the error function.

We introduce the error function

where .


Figure 8. Contribution to the error function G(w) for one data point as a function of y(n) for the two possible cases t(n)=0 and t(n)=1. (From D. Mackay's video lecture 15.)

</div>

For each data point , its contribution to the error function is given by Figure 8.

As a binary classification problem, such as sorting apples from oranges, we can interpret and (both between zero and one) as the probabilities of the two possible events: apple () or orange () for an input given the weights of the neuron, then is the “information content” (remember w01 lectures?) of the data .

Backpropagation

The training process is an exercise of minimizing , that is, of adjusting the weights so that reaches it lowest value.

Notice that is bound by below by zero,

and it is zero only when .

To minimize, we take the derivative of respect to one of the weights given by

Introducing

we obtain,

Taking all derivative together we construct the gradient vector,

The quantity is referred to as the error.

Backpropagation in the neural network community referees to given the errors, to calculate the gradient

Backpropagation differentiation.

Then we can implement a “gradient descent” method in which we iteratively update the weights by a quantity in the opposite direction to the gradient,

The parameter in the neural network community is referred to as the learning rate. And it is a free parameter that one has to set somehow.

Depending on whether we update the weights by looking to all data point at the time, or each one of them independently, we can distinguish two different algorithms, the batch learning algorithm and the on-line learning algorithm.

The batch gradient descent learning algorithm for a feedforward network

For a data set ,

We can repeat this two steps for a fixed and large number of iterations, or until the errors for all data points are smaller than a desired small number.

The on-line stochastic gradient descent learning algorithm

Alternative, we could update all weights, by taking one data point at the time. Start with a set of arbitrary weights

Then one can go back to select another point from the data set and repeat the process.

While batch learning is a gradient descent algorithm, on-line learning algorithm is a stochastic gradient descent algorithm.


Figure 9. Labeled data that we use as training set to learn the three weights of the neuron. Apples are represented in purple, and have target value t=1. Oranges are represented in green, and have target value t=0.

How well does the batch learning algorithm do?

Let’s go back to our apple/oranges classification task, using a neuron with two inputs and . Perhaps could be a measure of the skin color of the fruit, and could be a measure of the roughness of the skin for each fruit example.

There are then three weights , including the bias . The activity rule is given by


Figure 10. Parameter evolution as a function of the number of iterations in the batch-learning algorithm.

We assume we have labeled data, that is examples in which we have measured the value of the two variables and , and we know whether they are apples, A(t=1) or oranges, O(t=0). The labeled data is given in Figure 9.

We perform batch gradient descent with learning rate . Figure 10 describes the changes in the parameter values as a function of the number of iterations. Figure 11 describes contour plots for different number of iterations.

Regularization: beyond descent on the error function


Figure 11. Contour plots for different number of iterations

We have seen that if the data is in fact linearly separable, the algorithm works. But as the number of iterations becomes larger and larger, the weights also become larger in magnitude, and the logistic function becomes steeper and steeper (Figure 11). That is an undesirable behavior named overfitting.

Why is it undesirable?

A way to avoid having weights arbitrarily large and overfitting to the data has the name of regularization. The idea consists of adding a term to the optimization function such that it penalizes large values of the weights. We introduce the modified objective function

where

The weight update rule in the presence of this regularization becomes

and the parameter is called the weight decay regularizer.


Figure 12. Effect of regularization for a linear neuron.

In Figure 12, we show examples of different weight-decay regularizer values and its effect on the learned weights.

Learning as a communication channel


Figure 13. Learning as communication. Figure extracted from MacKay's Chapter 40, combined with the cover of his lectures.

We can think of learning as a communication method. From a large number of inputs , we learn parameters (weights) which normally live in a much lower dimensional space than the input data , . Then I could give you the learned weights and just the inputs, and you should be able to approximate the label inputs by using the activity of the neural network

which in a well trained network should be similar to the original labels .

The processes of estimating the parameters receives the name of coding. Coding is what in a probabilistic framework, we have been calling estimating the posterior probability of the parameters . (See Figure 13.)

The process of reconstructing the labels from the weights receives the name of decoding. Decoding is what in a probabilistic framework, we have been callingin this course the probability of the data given a set of parameters (aka the likelihhod of the parameters) . (See Figure 13.)

Learning as inference (Probabilistic interpretation of learning)

Every time someone is optimizing a function, you can take it, exponentiate it, and try to interpret it as a probability distribution. Here we are going to do just that for our one neuron network.

For the single neuron network, we find the weights of the neuron by optimizing the function of the weights,

is the error function that depends on the data , and the inputs . On the other hand, is the regularization term that does not depend on the data. This should remind you of the posterior probability of the parameters

where depends on the data, but the prior does not.

We can interpret that the posterior probability of the weights as given by

such that, the probability of the data in terms of the error function

and the prior in terms of the regularization term

Because of expression for , we can see that for each individual label

or

For the regularization function , then the prior probability, is a Gaussian distribution

with mean , and . Therefore .

Making further inferences

With a probabilistic interpretation of learning in hand, now we can use our neuron not just to learn the parameters, and fit the given data, but also to make inference about the probability of a future label corresponding to a set of new inputs as

that is, we can calculate the neuronal output being , by summing to the probability of the neuronal output being for all possible values of the parameters weighted by the probability of the parameters given the data so far.

For our neuron with two inputs

Calculating the posterior probability can be a bear, even for a single neuron with two inputs. We need to use some approximate method.

Monte Carlo implementation for a single neuron

We are going to use an approximate sampling method, the Monte Carlo approach, in which we take a number of samples of values of the weights, according to the posterior distribution, and then we approximate

How to obtain the representative parameter samples ?

Here I am going to use the Metropolis-Hastings Monte Carlo method (there are others). We start with one arbitrary set of parameters (weights) , and calculate .

These two steps get repeated for a large number of iterations.


Figure 14. Parameter evolution as a function of the number of iterations in the Metropolis-Hastings Monte Carlo algorithm. The total number of iterations is 100,000. The learning rate is 0.01.

We can use the Metropolis-Hastings Monte Carlo algorithm to calculate , for our problem in Figure 9 of separating apples from oranges, which we can compare with the batch-learning algorithm we implemented before.


Figure 15. Contour plots for different number of iterations of the MC algorithm. The contours correspond to y=0.5, y=0.27 and y=0.73.

In Figure 14, we observe how the weights evolve with the iterations of the Monte Carlo algorithm, and how that compares to Figure 10 for the batch-learning algorithm. The Monte Carlo approach introduces a lot more variability in the weights space.

In Figure 15, we describe several contours from MC iterations, starting after iteration 10,000.


Figure 16. (Left) Contours obtained after averaging 30 samples of weights sampled from the posterior using the MC algorithm. (Right) contours obtained by taking the most probable parameters under the batch-learning algorithm.

In Figure 16, we show the result of estimating the using Bayesian inference, compared to the same calculation using a fixed value of the weights obtained by backpropagation.

What do you think of the difference? I think the Bayesian approach of taking several samples from the posterior distribution (Figure 16, left) is a big improvement over just taking one value of the parameters (Figure 16, right). Can you tell why?

Feedback networks

Our one neuron network belongs to the category of feedforward networks, in which arrows flow in one direction. Another type of networks are those in which all neurons are connected to all other neurons, those are called feedback networks (Figure 1).

Now we are going to study of a particular type of feedback network in the context of a “memory problem”.


Figure 17. Memory representation for several letters.

Content addressable memories

We would like to design a network to mimic how human memory works. In particular, we want our network to memorize letter. I will describe letters with a grid. Figure 17 has some examples.


Figure 18. Noisy memory representations of an "A".

Some basic properties of human memory that we would like our memory network to reproduce are:

For instance, if the network has learned “A”, it should be able to recognize as “A’s” all three variants shown in Figure 18.

We are going to build this memory network using a Hopfield network, which an example of a feedback network.

Hopfield Network Definition


Figure 19. Architecture of a Hopfield feedback network (from MacKay's Chapter 42).

In a Hopfield network, there are N neurons all connected to each other.

Now we have all we need to know about a Hopfield feedback network.

If we want to build a Hopfield network for one of our letters, say “A”, we need neurons connected by weights. Letter “A” (Figure 17) is represented by

 "A" = -1, -1, 1, -1, -1, -1, 1, -1, 1, -1, -1, 1, 1, 1, -1, -1, 1, -1, 1, -1, 1, -1, -1, -1, 1.

Using Hebb’s rule for , we can build the weight of the network which are given by

 0  1 -1  1  1  1 -1  1 -1  1  1 -1 -1 -1  1  1 -1  1 -1  1 -1  1  1  1 -1 
 1  0 -1  1  1  1 -1  1 -1  1  1 -1 -1 -1  1  1 -1  1 -1  1 -1  1  1  1 -1 
-1 -1  0 -1 -1 -1  1 -1  1 -1 -1  1  1  1 -1 -1  1 -1  1 -1  1 -1 -1 -1  1 
 1  1 -1  0  1  1 -1  1 -1  1  1 -1 -1 -1  1  1 -1  1 -1  1 -1  1  1  1 -1 
 1  1 -1  1  0  1 -1  1 -1  1  1 -1 -1 -1  1  1 -1  1 -1  1 -1  1  1  1 -1 
 1  1 -1  1  1  0 -1  1 -1  1  1 -1 -1 -1  1  1 -1  1 -1  1 -1  1  1  1 -1 
-1 -1  1 -1 -1 -1  0 -1  1 -1 -1  1  1  1 -1 -1  1 -1  1 -1  1 -1 -1 -1  1 
 1  1 -1  1  1  1 -1  0 -1  1  1 -1 -1 -1  1  1 -1  1 -1  1 -1  1  1  1 -1 
-1 -1  1 -1 -1 -1  1 -1  0 -1 -1  1  1  1 -1 -1  1 -1  1 -1  1 -1 -1 -1  1 
 1  1 -1  1  1  1 -1  1 -1  0  1 -1 -1 -1  1  1 -1  1 -1  1 -1  1  1  1 -1 
 1  1 -1  1  1  1 -1  1 -1  1  0 -1 -1 -1  1  1 -1  1 -1  1 -1  1  1  1 -1 
-1 -1  1 -1 -1 -1  1 -1  1 -1 -1  0  1  1 -1 -1  1 -1  1 -1  1 -1 -1 -1  1 
-1 -1  1 -1 -1 -1  1 -1  1 -1 -1  1  0  1 -1 -1  1 -1  1 -1  1 -1 -1 -1  1 
-1 -1  1 -1 -1 -1  1 -1  1 -1 -1  1  1  0 -1 -1  1 -1  1 -1  1 -1 -1 -1  1 
 1  1 -1  1  1  1 -1  1 -1  1  1 -1 -1 -1  0  1 -1  1 -1  1 -1  1  1  1 -1 
 1  1 -1  1  1  1 -1  1 -1  1  1 -1 -1 -1  1  0 -1  1 -1  1 -1  1  1  1 -1 
-1 -1  1 -1 -1 -1  1 -1  1 -1 -1  1  1  1 -1 -1  0 -1  1 -1  1 -1 -1 -1  1 
 1  1 -1  1  1  1 -1  1 -1  1  1 -1 -1 -1  1  1 -1  0 -1  1 -1  1  1  1 -1 
-1 -1  1 -1 -1 -1  1 -1  1 -1 -1  1  1  1 -1 -1  1 -1  0 -1  1 -1 -1 -1  1 
 1  1 -1  1  1  1 -1  1 -1  1  1 -1 -1 -1  1  1 -1  1 -1  0 -1  1  1  1 -1 
-1 -1  1 -1 -1 -1  1 -1  1 -1 -1  1  1  1 -1 -1  1 -1  1 -1  0 -1 -1 -1  1 
 1  1 -1  1  1  1 -1  1 -1  1  1 -1 -1 -1  1  1 -1  1 -1  1 -1  0  1  1 -1 
 1  1 -1  1  1  1 -1  1 -1  1  1 -1 -1 -1  1  1 -1  1 -1  1 -1  1  0  1 -1 
 1  1 -1  1  1  1 -1  1 -1  1  1 -1 -1 -1  1  1 -1  1 -1  1 -1  1  1  0 -1 
-1 -1  1 -1 -1 -1  1 -1  1 -1 -1  1  1  1 -1 -1  1 -1  1 -1  1 -1 -1 -1  0 .

The memory challenge

Can the network trained on “A”, recognize an imperfect A?

Using this network trained on “A”, we can see how well the network fixes imperfect memories. We provided to the network imperfect “A”s with variable number of wrong pixels, and we found:

Average error for A: before   after
                      5.24%   0.00%
		  8.38%   0.00%
		 12.52%   0.00%
                     16.39%   0.00%
		 20.10%   0.08%

That is the network corrects most of the erroneous versions of “A”.

What happens if we add another letter to the network?

It is easy to see how to the Hebb’s rule allows to add new memories to the network. If we now add the letters “C” and “Z”, the weights change to:

 0  1 -1  1  1 -1 -1  1  1  1 -1 -1  1 -1  1 -1  1  1 -1  1  1  1  1  1 -1 
 1  0  1  3  3  1 -3 -1 -1 -1  1 -3 -1 -3 -1  1 -1 -1 -3 -1 -1  3  3  3  1
-1  1  0  1  1 -1 -1 -3  1 -3 -1 -1  1 -1 -3 -1  1 -3 -1 -3  1  1  1  1  3 
 1  3  1  0  3  1 -3 -1 -1 -1  1 -3 -1 -3 -1  1 -1 -1 -3 -1 -1  3  3  3  1 
 1  3  1  3  0  1 -3 -1 -1 -1  1 -3 -1 -3 -1  1 -1 -1 -3 -1 -1  3  3  3  1 
-1  1 -1  1  1  0 -1  1 -3  1  3 -1 -3 -1  1  3 -3  1 -1  1 -3  1  1  1 -1 
-1 -3 -1 -3 -3 -1  0  1  1  1 -1  3  1  3  1 -1  1  1  3  1  1 -3 -3 -3 -1 
 1 -1 -3 -1 -1  1  1  0 -1  3  1  1 -1  1  3  1 -1  3  1  3 -1 -1 -1 -1 -3 
 1 -1  1 -1 -1 -3  1 -1  0 -1 -3  1  3  1 -1 -3  3 -1  1 -1  3 -1 -1 -1  1 
 1 -1 -3 -1 -1  1  1  3 -1  0  1  1 -1  1  3  1 -1  3  1  3 -1 -1 -1 -1 -3 
-1  1 -1  1  1  3 -1  1 -3  1  0 -1 -3 -1  1  3 -3  1 -1  1 -3  1  1  1 -1 
-1 -3 -1 -3 -3 -1  3  1  1  1 -1  0  1  3  1 -1  1  1  3  1  1 -3 -3 -3 -1 
 1 -1  1 -1 -1 -3  1 -1  3 -1 -3  1  0  1 -1 -3  3 -1  1 -1  3 -1 -1 -1  1 
-1 -3 -1 -3 -3 -1  3  1  1  1 -1  3  1  0  1 -1  1  1  3  1  1 -3 -3 -3 -1 
 1 -1 -3 -1 -1  1  1  3 -1  3  1  1 -1  1  0  1 -1  3  1  3 -1 -1 -1 -1 -3 
-1  1 -1  1  1  3 -1  1 -3  1  3 -1 -3 -1  1  0 -3  1 -1  1 -3  1  1  1 -1 
 1 -1  1 -1 -1 -3  1 -1  3 -1 -3  1  3  1 -1 -3  0 -1  1 -1  3 -1 -1 -1  1 
 1 -1 -3 -1 -1  1  1  3 -1  3  1  1 -1  1  3  1 -1  0  1  3 -1 -1 -1 -1 -3 
-1 -3 -1 -3 -3 -1  3  1  1  1 -1  3  1  3  1 -1  1  1  0  1  1 -3 -3 -3 -1 
 1 -1 -3 -1 -1  1  1  3 -1  3  1  1 -1  1  3  1 -1  3  1  0 -1 -1 -1 -1 -3 
 1 -1  1 -1 -1 -3  1 -1  3 -1 -3  1  3  1 -1 -3  3 -1  1 -1  0 -1 -1 -1  1 
 1  3  1  3  3  1 -3 -1 -1 -1  1 -3 -1 -3 -1  1 -1 -1 -3 -1 -1  0  3  3  1 
 1  3  1  3  3  1 -3 -1 -1 -1  1 -3 -1 -3 -1  1 -1 -1 -3 -1 -1  3  0  3  1 
 1  3  1  3  3  1 -3 -1 -1 -1  1 -3 -1 -3 -1  1 -1 -1 -3 -1 -1  3  3  0  1 
-1  1  3  1  1 -1 -1 -3  1 -3 -1 -1  1 -1 -3 -1  1 -3 -1 -3  1  1  1  1  0 .

After adding three letter to the network, it is still capable of fixing most of the bad representations of A

Average error for A: before   after
                      5.24%   0.29%
		  8.38%   0.45%
                     12.52%   0.86%
                     16.39%   1.70%
		 20.10%   2.87%

By the time we have added 5 memories, the network becomes more and more imperfect.

Average error for A: before    after
                      5.24%     3.51%
                      8.38%     6.11%
		  12.52%    8.25%
                      16.39%   10.94%
                      20.10%   13.65%

Similarly, we could observe the behavior of the network if we corrupt some of the weights in the network.