# Week 8 section¶

In this notebook, we mainly cover Hopfield networks, which we didn't get to discuss much in class. This is one option for you for implementing w08 homework.

The other option is to build a feedforward one-neuron network to distinguish between two classes of letters. If you choose this option, we have a few tips at the end of this notebook.

## Hopfield networks¶

For this week's homework one option is to implement a Hopfield network that can hopefully distinguish letters. As always the first hurdle comes with the data.

From the homework's website here are lists populated with url that points to the reference and handwritten characters that we are going to be using throughout:

In a similar fashion of what we've done in previous weeks we are going to define a function that can read these letters from the url. We know that these letters are going to be 5x5 grids populated with certain characters, and we might have more than one example of the letter at each url:

let's test it and see how we can visualize these letters:

There seem to be 9 examples of letter A in this data, let's go ahead and see all of them:

How about the first example of each letter?

What about the handwritten examples for letter A

There seem to be 2000 examples of the handwritten character A, let's only look at a couple of them

The handwritten aren't perfect like the reference, but we will implement the Hopfield network and see how many of them can be recognized as A's.

How do we implement the hopfield network?

From Mackay's textbook:

In this chapter we will discuss a fully connected feedback network called the Hopfield network. The weights in the Hopfield network are constrained to be symmetric, i.e., the weight from neuron i to neuron j is equal to the weight from neuron j to neuron i.

Hopfield networks have two applications. First, they can act as associative memories.

Then it goes to describe Hebbian learning, in a way that is very intuitive. I recommend if you can that you read Mackay's section 42.1 if you are struggling to understand what's going on conceptually.

Let's break down a couple of things that you are going to want to be able to do in order to implement the weight matrix.

### The weight matrix¶

Hopfield networks are comprised of fully connected neurons. That means every neuron is connected to their neighbours and vice-versa. These connections describe the flow of information through the network using weights. In Hopfield networks, such connections are symmetrical. Therefore, $w_{1,2} = w_{2,1}$. In this notation, $w_{m,n}$ corresponds to the $m$ th row and $n$ th column of $\mathbf{W}$, the weight matrix. In other words:

$$\forall m,n: \\w_{n,m} = w_{m,n}$$

And the neurons do not have a self connection. the weight for a connection between a neuron and itself is zero, in other words:

$$w_{n,n} = 0$$

This already gives us a pretty good idea of what such a weight matrix has to look like: zeros on the main diagonal and symmetric along that axis.

Now we want to decode our 5 by 5 image into a format that can be uptaken by the network we are building, by linearizing the array in which we are storing the information using the np.flatten() method we can achieve this, for example:

Notice that this is just a 1 dimensional representation of the same image, simply we add the first row, then the second and so on.

Given a 25-dimensional input vector $y$, and the 25 $\times$ 25 dimensional weight matrix $w$, then the activation $a$ of the $n$ th neuron, $a_n$, is

\begin{aligned} a_n &= \sum_{m=1}^{25} w_{nm} y_m \\ &= w_{n,1} y_1 + w_{n,2} y_2 + \dots + w_{n,25} y_{25} \end{aligned}

Let's say we trained our connectivity matrix by storing a perfect A. As seen in the lecture notes, the first row of $w$ is

0 1 -1 1 1 1 -1 1 -1 1 1 -1 -1 -1 1 1 -1 1 -1 1 -1 1 1 1 -1

Now let's say we provide as input to the network a perfect A -- we better get a perfect A as the output of the network!

Let's just focus on the first row of our 5 $\times$ 5 digit representation. Our perfect A input is $y = [-1, -1, +1, -1, -1, \dots ]$.

To get a perfect A output, then we need basically this same vector -- first pixel needs to be off, as does the second pixel, then the third pixel needs to be on, then pixels 4 and 5 need to be off.

Let's look at the output of the network when providing the above $y$ as input.

\begin{aligned} a_1 &= w_{1,1} y_1 + w_{1,2} y_2 + \dots + w_{1,25} y_{25} \\ &= 0(-1) +1(-1) -1(+1) +1(-1) +1(-1) + \dots \\ &= -24 \end{aligned}

(Ignoring the first position because neuron 1 can't talk to itself), every single time that $y$ has a positive 1 in an entry, the connectivity matrix entry is a negative 1, and vice versa. This makes sense -- if we want a perfect A as an output, we need the first pixel to be off. By storing the perfect A into the network, then we've encoded the information that when the first pixel is off, so is the second pixel, and pixels 4 and 5, but pixel 3 is on. The result of the above operation makes a very negative number -- that is the activation $a_n$. We get an activity $y_n$ via

$$y(a) = \left\{\begin{matrix} +1 & a \geq 0\\ -1 & a < 0\\ \end{matrix} \right.$$

This is where np.where() might come in handy, to scan a vector for a condition, for example its numbers being higher than 0 and replacing it's value for 1 or -1. we use this function above in the load_letter function.

So by this rule, $y_1 = y(a_1) = y(-24) = -1$.

Now, what if we passed in a slightly corrupted version of an A? Again let's just restrict ourselves to the top row of the image. Say that we had the second pixel on, when ideally it should be off: $y_{corrupt} = [-1, +1, +1, -1, -1, \dots ]$.

\begin{aligned} a_1 &= w_{1,1} y_{corrupt, 1} + w_{1,2} y_{corrupt, 2} + \dots + w_{1,25} y_{corrupt, 25} \\ &= 0(-1) +1(+1) -1(+1) +1(-1) +1(-1) + \dots \\ &= -22 \end{aligned}

A-ha: we have a single mismatch at the second index, contributing a $+1$ to our summation. But there are so many matches at all the other pixels that the activation is still very negative, so the activity $y_1$ after passing in the imperfect A is still $-1$ and the pixel is off.

This is the fundamental property of the Hopfield network -- by storing patterns and the pairwise interactions between pixels, then imperfect versions of the patterns that get passed in to the network can get "corrected" by all the stored pairwise information between pixels -- you just need enough pixels to be on when they should be on, and off when they should be off, to fix imperfect patterns.

### But how do we actually set the weights?¶

For that we are going to use Hebb's rule, we are going to create a matrix of the appropriate size and populate it according to this rule, this is the form of the rule for a single pattern:

$$w_{nm} = y_n y_m$$

Let's say we wanted to add the memory A to the network, here we show how the first row would look like:

If you wanted to store $k$ patterns, you add the product at the $n$ and $m$ entries for each pattern:

$$w_{nm} = (y^1_{n} y^1_{m}) +(y^2_{n} y^2_{m}) + \dots + (y^k_{n} y^k_{m})$$$$w_{nm} = \sum_{k=1}^K y_n^{(k)} y_m^{(k)}$$

In your own implementation, it would be tedious writing the above out by hand -- think about how you could use a for loop to build the weight matrix instead.

## Tips for feedforward neuron model¶

• Choose two letters -- say A and B, or some other pair, and collect all of the imperfect As and Bs -- let's call this your data $X$, and their labels $t$ (whether each imperfect letter is an A or a B -- you need to pick one letter to be +1 and the other to be -1).
• It is good practice to split up your data into a training set, which you use to train a model like your one-neuron network, and then a testing set, so that you can evaluate the model's performance on unseen data.
• To do this, if you have 2000 examples of A and 2000 examples of B, it would be good to randomize the order of the data. If $N=4000$ (the total number of examples, both As and Bs), then you can get a random ordering via np.random.choice(np.arange(N), N, replace=False). Then, set some number of training points N_train, and use X_train = X_shuffled[:N_train] as your training examples. Here's an example of making a random shuffling:
• The neuron model actually has 26 parameters -- the activation $a$ is $w_0 + w_1 x_1 + \dots w_{25} x_{25}$. The easiest way to implement this is to add a column of 1s to your $X$ data (think of it as a hidden $x_0=1$ that gets multiplied by $w_0$). Here's how you can append a column of 1s to an array:
• The key thing to implement is the update of the weight vector
• If you use stochastic gradient descent, then in each step of training, you
• pick a random data point, $X^{(m)}$,
• predict its label via the current state of the model, $y^{(m)} = \frac{1}{1+e^{-\sum_j w_{old, j} X^{(m)}_j }}$,
• then get its true label $t^{(m)}$, and
• update your weights via $w_{new} = w_{old} + \eta (t^{(m)} - y^{(m)}) X^{(m)}$
• If you use batch gradient descent, then in each step of training you pass through the entire dataset, so each weight update uses all of the data: $w_{new} = w_{old} + \eta \sum_n (t^{(n)} - y^{(n)}) X^{(n)}$
• It would be good to compare what the results look like with stochastic gradient descent and the batch version -- which method takes longer to run? By how much?
• When just starting out, it could be good to set some reasonable max number of steps that you know won't take too long to wait, just to get a sense of the performance of the model. After that, think about convergence conditions (the loss function, the weights, etc.)
• KEY: look at the resultant weights of the network!!! Especially of the neurons representing pixels of the letters (i.e. not $w_0$, the bias). Here's one way to link the 25 weights back to our 5x5 representation:
• The weights in the one-neuron model have a direct interpretation. Let's say you said that letter A is "+1" and letter B is "-1", and you want to know whether a particular corrupted letter is more likely to be an A or a B. Each of the 25 non-bias weights in the network is tied to a pixel, and each relates to how likely it is that the letter is an A or B if the letter has that pixel ON. For instance, B's have flat edges along the top, so the first pixel, in the top left corner, is usually going to be ON in examples of B's. In contrast, A's have pointy tops and very rarely have lit-up top left corners. So you'd expect the weight of the first pixel to be negative (when you see an example with a lit-up first pixel, then that should clue you in to a B, which is "-1" in our scheme, so the contribution from that first pixel should be negative).