MCB111: Mathematics in Biology (Spring 2018)
 Feedforward networks
 Feedback networks
week 08:
Neural Networks  Learning as Inference
In this lecture, I follow David Mackay very closely. In particular his lectures 15 and 16, which correspond to Chapters 39, 41 and 42 of his book Information Theory, Inference, and Learning algorithms.
Feedforward networks
A single neuron
A single neuron (Figure 1) has
 The inputs ,
 One output which is called the activity,
 Parameters , usually called the weights.
The neuron adds up the weighted sum of the inputs into a variable called the activation ,
Figure 1. One neuron (from D. Mackay's chapter 39).
where called the bias is the activation in the absence of inputs.
The activity of the neuron is a function of the activation . Several commonly used forms for the activity are

The linear logistic function

The sigmoid (tanh) function

The step function
Depending in how we combine many single neurons, there is essentially two types of networks: feedforward networks, where all the information flows in one direction, and feedback networks where all nodes are connected (Figure 2).
Figure 2. Two type of networks (from D. Mackay's Lecture 15).
Parts of a single neuron
As a recap, the basic concepts of a neuron are

The Architecture. A single neuron has a number of inputs , and one output . Each input has associated a weight , for .

The Activation. In response to the inputs, the neuron computes the activation
and , so we can add the bias in the same equation.

The Activity rule. The neuron output is set as a logistic/step function of the activation and the inputs .
We are going to study a neuron in which the output is between (0,1), such that the activity is given by the logistic function
In the sections, we will discuss some motivation for the linear logistic function.
The activity can be seen as the probability according to the neuron that the input deserves a response, and the neuron fires () or that the input is not worth a response and the neuron does not fire ().
Figure 3. Neuron activity for a neuron with two inputs as a function of the inputs. The values of the weights are: w0 = 15, w1 = 2 w2 = 1. We use the logistic function to describe the activity of the neuron.
The space of weights
Consider the simple situation of only two input to the neuron , and only two weights and no bias. The neuron’s activation is given by
and we can plot it as a function of the inputs as in Figure 3.
Figure 4. Effect on neuron activity of a zero weight.
If a weight is zero, there is no activity change associated with changes in the corresponding input, as described in Figure 4.
Figure 5. Effect on neuron activity of the bias term.
The effect of changing the value of the bias can be observed in Figure 5.
The effect of changing the value of the weights can be observed in Figure 6.
Figure 6. Effect on neuron activity of changes in one of the weights.
We can also use contour plots, in which we represent in the inputs space lines corresponding to different values of the activity . Using contour plots we observe in Figure 7 the effect of making the weight larger or smaller in absolute magnitude. Doubling the weights makes the contour lines closer to each other, while halving the weight make the same contour lines wider.
Figure 7. Effect on neuron activity of scaling the weights. The contours correspond to the values a = 0.0, 1, and 1. Arrows point in the direction of the weight vector.
What a single neuron can learn: to be a binary classifier
Imagine that you want to separate apples from oranges from a bucket in which they are all mixed up. Would you trust a single neuron to do that?
The learning rule
The central idea of supervised learning is this: given a number of examples of the input vector and their target output , we hope the neuron will learn their relationship (whichever that is).
Training the network requires finding the values of the weights that best fit the training data. If the neuron is well trained, given an input will produce an output which is very close to the target .
Data:
Outputs:
Error: where we expect these errors to be small.
Thus “learning” is equivalent to adjusting the parameters (weights) of the network such that the output of the network is close to the target for all examples.
Have we done this before in a different context?
In our apples and oranges example, let’s give an assignment of for an apple, and for an orange. In a well trained neuron will produce assignments similar to these:
Orange_1 y(Orange_1, w) = 0.01
Apple_1 y(Apple_1, w) = 0.91
Orange_2 y(Orange_2, w) = 0.09
Orange_3 y(Orange_3, w) = 0.05
Apple_2 y(Apple_1, w) = 0.97
How to do that? We find a function to optimize.
The error function
For each input , we introduce an objective function that will measure how close the neuron output is to . That objective function is also called the error function.
We introduce the error function
where .
Figure 8. Contribution to the error function G(w) for one data point as a function of y(n) for the two possible cases t(n)=0 and t(n)=1. (From D. Mackay's video lecture 15.)
</div>
For each data point , its contribution to the error function is given by Figure 8.
As a binary classification problem, such as sorting apples from oranges, we can interpret and (both between zero and one) as the probabilities of the two possible events: apple () or orange () for an input given the weights of the neuron, then is the “information content” (remember w01 lectures?) of the data .
Backpropagation
The training process is an exercise of minimizing , that is, of adjusting the weights so that reaches it lowest value.
Notice that is bound by below by zero,
and it is zero only when .
To minimize, we take the derivative of respect to one of the weights given by
Introducing
we obtain,
Taking all derivative together we construct the gradient vector,
The quantity is referred to as the error.
Backpropagation in the neural network community referees to given the errors, to calculate the gradient
Backpropagation differentiation.
Then we can implement a “gradient descent” method in which we iteratively update the weights by a quantity in the opposite direction to the gradient,
The parameter in the neural network community is referred to as the learning rate. And it is a free parameter that one has to set somehow.
Depending on whether we update the weights by looking to all data point at the time, or each one of them independently, we can distinguish two different algorithms, the batch learning algorithm and the online learning algorithm.
The batch gradient descent learning algorithm for a feedforward network
For a data set ,
 Start with a set of arbitrary weights , and use the activity rule to calculate for each data point
 Use backpropagation to calculate the next set of values for the weights as
We can repeat this two steps for a fixed and large number of iterations, or until the errors for all data points are smaller than a desired small number.
The online stochastic gradient descent learning algorithm
Alternative, we could update all weights, by taking one data point at the time. Start with a set of arbitrary weights
 Select one data point , and use the activity rule to calculate
 Use backpropagation to calculate the next set of values for the weights as
Then one can go back to select another point from the data set and repeat the process.
While batch learning is a gradient descent algorithm, online learning algorithm is a stochastic gradient descent algorithm.
Figure 9. Labeled data that we use as training set to learn the three weights of the neuron. Apples are represented in purple, and have target value t=1. Oranges are represented in green, and have target value t=0.
How well does the batch learning algorithm do?
Let’s go back to our apple/oranges classification task, using a neuron with two inputs and . Perhaps could be a measure of the skin color of the fruit, and could be a measure of the roughness of the skin for each fruit example.
There are then three weights , including the bias . The activity rule is given by
Figure 10. Parameter evolution as a function of the number of iterations in the batchlearning algorithm.
We assume we have labeled data, that is examples in which we have measured the value of the two variables and , and we know whether they are apples, A(t=1) or oranges, O(t=0). The labeled data is given in Figure 9.
We perform batch gradient descent with learning rate . Figure 10 describes the changes in the parameter values as a function of the number of iterations. Figure 11 describes contour plots for different number of iterations.
Regularization: beyond descent on the error function
Figure 11. Contour plots for different number of iterations
We have seen that if the data is in fact linearly separable, the algorithm works. But as the number of iterations becomes larger and larger, the weights also become larger in magnitude, and the logistic function becomes steeper and steeper (Figure 11). That is an undesirable behavior named overfitting.
Why is it undesirable?
A way to avoid having weights arbitrarily large and overfitting to the data has the name of regularization. The idea consists of adding a term to the optimization function such that it penalizes large values of the weights. We introduce the modified objective function
where
The weight update rule in the presence of this regularization becomes
and the parameter is called the weight decay regularizer.
Figure 12. Effect of regularization for a linear neuron.
In Figure 12, we show examples of different weightdecay regularizer values and its effect on the learned weights.
Learning as a communication channel
Figure 13. Learning as communication. Figure extracted from MacKay's Chapter 40, combined with the cover of his lectures.
We can think of learning as a communication method. From a large number of inputs , we learn parameters (weights) which normally live in a much lower dimensional space than the input data , . Then I could give you the learned weights and just the inputs, and you should be able to approximate the label inputs by using the activity of the neural network
which in a well trained network should be similar to the original labels .
The processes of estimating the parameters receives the name of coding. Coding is what in a probabilistic framework, we have been calling estimating the posterior probability of the parameters . (See Figure 13.)
The process of reconstructing the labels from the weights receives the name of decoding. Decoding is what in a probabilistic framework, we have been callingin this course the probability of the data given a set of parameters (aka the likelihhod of the parameters) . (See Figure 13.)
Learning as inference (Probabilistic interpretation of learning)
Every time someone is optimizing a function, you can take it, exponentiate it, and try to interpret it as a probability distribution. Here we are going to do just that for our one neuron network.
For the single neuron network, we find the weights of the neuron by optimizing the function of the weights,
is the error function that depends on the data , and the inputs . On the other hand, is the regularization term that does not depend on the data. This should remind you of the posterior probability of the parameters
where depends on the data, but the prior does not.
We can interpret that the posterior probability of the weights as given by
such that, the probability of the data in terms of the error function
and the prior in terms of the regularization term
Because of expression for , we can see that for each individual label
or
For the regularization function , then the prior probability, is a Gaussian distribution
with mean , and . Therefore .
Making further inferences
With a probabilistic interpretation of learning in hand, now we can use our neuron not just to learn the parameters, and fit the given data, but also to make inference about the probability of a future label corresponding to a set of new inputs as
that is, we can calculate the neuronal output being , by summing to the probability of the neuronal output being for all possible values of the parameters weighted by the probability of the parameters given the data so far.
For our neuron with two inputs
Calculating the posterior probability can be a bear, even for a single neuron with two inputs. We need to use some approximate method.
Monte Carlo implementation for a single neuron
We are going to use an approximate sampling method, the Monte Carlo approach, in which we take a number of samples of values of the weights, according to the posterior distribution, and then we approximate
How to obtain the representative parameter samples ?
Here I am going to use the MetropolisHastings Monte Carlo method (there are others). We start with one arbitrary set of parameters (weights) , and calculate .

Modify the weights by a random amount,
and calculate the new function .

Accept the new weights under one of these two conditions:

The new weights have better probability than the old ones
, that is accept if
This condition is like the gradient descent of backpropagation, in which for sure the new weights increase the probability.

Even if the new weights have worse probability than the old one , accept the new parameterization provided that
, that is, accept if .
where is a random number between 0 and 1.
Under this second condition condition, we are introducing some noise, hoping that the optimization will not get stuck in a local minimum.

These two steps get repeated for a large number of iterations.
Figure 14. Parameter evolution as a function of the number of iterations in the MetropolisHastings Monte Carlo algorithm. The total number of iterations is 100,000. The learning rate is 0.01.
We can use the MetropolisHastings Monte Carlo algorithm to calculate , for our problem in Figure 9 of separating apples from oranges, which we can compare with the batchlearning algorithm we implemented before.
Figure 15. Contour plots for different number of iterations of the MC algorithm. The contours correspond to y=0.5, y=0.27 and y=0.73.
In Figure 14, we observe how the weights evolve with the iterations of the Monte Carlo algorithm, and how that compares to Figure 10 for the batchlearning algorithm. The Monte Carlo approach introduces a lot more variability in the weights space.
In Figure 15, we describe several contours from MC iterations, starting after iteration 10,000.
Figure 16. (Left) Contours obtained after averaging 30 samples of weights sampled from the posterior using the MC algorithm. (Right) contours obtained by taking the most probable parameters under the batchlearning algorithm.
In Figure 16, we show the result of estimating the using Bayesian inference, compared to the same calculation using a fixed value of the weights obtained by backpropagation.
What do you think of the difference? I think the Bayesian approach of taking several samples from the posterior distribution (Figure 16, left) is a big improvement over just taking one value of the parameters (Figure 16, right). Can you tell why?
Feedback networks
Our one neuron network belongs to the category of feedforward networks, in which arrows flow in one direction. Another type of networks are those in which all neurons are connected to all other neurons, those are called feedback networks (Figure 1).
Now we are going to study of a particular type of feedback network in the context of a “memory problem”.
Figure 17. Memory representation for several letters.
Content addressable memories
We would like to design a network to mimic how human memory works. In particular, we want our network to memorize letter. I will describe letters with a grid. Figure 17 has some examples.
Figure 18. Noisy memory representations of an "A".
Some basic properties of human memory that we would like our memory network to reproduce are:

A desired memory stays (desired memories are attractive minima)

Can add one memory at the time without losing previous ones, and without requiring major changes in the network architecture.

Noisy versions of existing memories are identified as such.

Robust to memory impediments (e.g. you have been drinking).
For instance, if the network has learned “A”, it should be able to recognize as “A’s” all three variants shown in Figure 18.
We are going to build this memory network using a Hopfield network, which an example of a feedback network.
Hopfield Network Definition
Figure 19. Architecture of a Hopfield feedback network (from MacKay's Chapter 42).
In a Hopfield network, there are N neurons all connected to each other.

Architecture
neurons fully connected with bidirectional connections (weights) that are symmetric
Neurons do not have a selfconnection, that is .
We denote as the output of neuron .

Activity
Every neuron’s output is an input for all other neurons:

Activity
Uses a step function

Learning rule
For a Hopfield network with neurons, there are weights to determine.
The rule to set the weights is the Hebb’s rule. If we want to learn letters (or patterns), we assign
The Hebb rule has a biological motivation. If two neurons are correlated, it makes sense that an increase in the activity of one of the neurons would increase the activity of the other neuron. Hebb’s rule introduced in 1949 by Donald Hebb does exactly that.
Now we have all we need to know about a Hopfield feedback network.
If we want to build a Hopfield network for one of our letters, say “A”, we need neurons connected by weights. Letter “A” (Figure 17) is represented by
"A" = 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1.
Using Hebb’s rule for , we can build the weight of the network which are given by
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 .
The memory challenge
Can the network trained on “A”, recognize an imperfect A?
Using this network trained on “A”, we can see how well the network fixes imperfect memories. We provided to the network imperfect “A”s with variable number of wrong pixels, and we found:
Average error for A: before after
5.24% 0.00%
8.38% 0.00%
12.52% 0.00%
16.39% 0.00%
20.10% 0.08%
That is the network corrects most of the erroneous versions of “A”.
What happens if we add another letter to the network?
It is easy to see how to the Hebb’s rule allows to add new memories to the network. If we now add the letters “C” and “Z”, the weights change to:
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 0 1 3 3 1 3 1 1 1 1 3 1 3 1 1 1 1 3 1 1 3 3 3 1
1 1 0 1 1 1 1 3 1 3 1 1 1 1 3 1 1 3 1 3 1 1 1 1 3
1 3 1 0 3 1 3 1 1 1 1 3 1 3 1 1 1 1 3 1 1 3 3 3 1
1 3 1 3 0 1 3 1 1 1 1 3 1 3 1 1 1 1 3 1 1 3 3 3 1
1 1 1 1 1 0 1 1 3 1 3 1 3 1 1 3 3 1 1 1 3 1 1 1 1
1 3 1 3 3 1 0 1 1 1 1 3 1 3 1 1 1 1 3 1 1 3 3 3 1
1 1 3 1 1 1 1 0 1 3 1 1 1 1 3 1 1 3 1 3 1 1 1 1 3
1 1 1 1 1 3 1 1 0 1 3 1 3 1 1 3 3 1 1 1 3 1 1 1 1
1 1 3 1 1 1 1 3 1 0 1 1 1 1 3 1 1 3 1 3 1 1 1 1 3
1 1 1 1 1 3 1 1 3 1 0 1 3 1 1 3 3 1 1 1 3 1 1 1 1
1 3 1 3 3 1 3 1 1 1 1 0 1 3 1 1 1 1 3 1 1 3 3 3 1
1 1 1 1 1 3 1 1 3 1 3 1 0 1 1 3 3 1 1 1 3 1 1 1 1
1 3 1 3 3 1 3 1 1 1 1 3 1 0 1 1 1 1 3 1 1 3 3 3 1
1 1 3 1 1 1 1 3 1 3 1 1 1 1 0 1 1 3 1 3 1 1 1 1 3
1 1 1 1 1 3 1 1 3 1 3 1 3 1 1 0 3 1 1 1 3 1 1 1 1
1 1 1 1 1 3 1 1 3 1 3 1 3 1 1 3 0 1 1 1 3 1 1 1 1
1 1 3 1 1 1 1 3 1 3 1 1 1 1 3 1 1 0 1 3 1 1 1 1 3
1 3 1 3 3 1 3 1 1 1 1 3 1 3 1 1 1 1 0 1 1 3 3 3 1
1 1 3 1 1 1 1 3 1 3 1 1 1 1 3 1 1 3 1 0 1 1 1 1 3
1 1 1 1 1 3 1 1 3 1 3 1 3 1 1 3 3 1 1 1 0 1 1 1 1
1 3 1 3 3 1 3 1 1 1 1 3 1 3 1 1 1 1 3 1 1 0 3 3 1
1 3 1 3 3 1 3 1 1 1 1 3 1 3 1 1 1 1 3 1 1 3 0 3 1
1 3 1 3 3 1 3 1 1 1 1 3 1 3 1 1 1 1 3 1 1 3 3 0 1
1 1 3 1 1 1 1 3 1 3 1 1 1 1 3 1 1 3 1 3 1 1 1 1 0 .
After adding three letter to the network, it is still capable of fixing most of the bad representations of A
Average error for A: before after
5.24% 0.29%
8.38% 0.45%
12.52% 0.86%
16.39% 1.70%
20.10% 2.87%
By the time we have added 5 memories, the network becomes more and more imperfect.
Average error for A: before after
5.24% 3.51%
8.38% 6.11%
12.52% 8.25%
16.39% 10.94%
20.10% 13.65%
Similarly, we could observe the behavior of the network if we corrupt some of the weights in the network.