Mini-Review: Neural Network Models

(1) McCulloch & Pitts (1943)

Let the incoming signals be x_i (being 0 or 1), and outgoing signal be y, almost all neural network models assume:

y = f(S w_ix_i-s)

where s is a threshold value, w is synaptic strength, the function f(x) is 0 for very small x and 1 for large x. For example, we can use a step function h(x). McCulloch and Pitts showed that within this framework, any arbitrary logical function y(x₁, x₂, ..., x_n) can be realized by properly choosing the constants w, s, and connections of the "neurons".

"Logical circuit" (h(x) is the step function, h(x) = 0 if x < 0, and h(x) =1 if x >= 0)

(a) inverter: y = h(-2x + 1)

(b) logical AND: y = h(x₁ + x₂ - 1.5)

Question: how to build a logical OR using inverter and logical AND only? How to build exclusive OR?

McCulloch & Pitts model does not address the issue of learning, and in addition, the logical operation is not error tolerant.

(2) Hebb's Hypothesis

In his famous book "Organization of Behavior", Hebb proposed the idea that connection between two neutrons is plastic; the synaptic strength w changes proportional to the activities correlation between the pre-synaptic and post-synaptic cells. A simple mathematical model is to change w according to

w_i <- w_i + e y(x) x_i

(3) Perceptron (Rosenblatt 1958)

Hebb's idea can be used to solve a classification problem. This consists of two phases. First the network is trained by examples. After the training session, the synaptic strengths are fixed, the output from some input is then taking as the result of the "neutral computing".

In a single layer neural network, let row vector x (take real value, dimension N) be the input, and row vector y (take a binary pattern, dimension M) be the output, the matrix W of M by N is updated in each training example as

W <- W + e (y^corr - y) x^T

where x^Tis the transpose of x, i.e. a column vector. If solution exists, it can be shown that this algorithm converges for sufficiently small e in a finite number of steps. It turns out that the single-layer network has severe limitations. For example, an exclusive OR can not be realized. Thus multi-layer network is introduced.

(4) Associative Memory

Memory plays an extremely important role in brain function. The associative memory network is very much like the classification problem. Consider that p pairs of "information" (x₁,y₁), (x₂, y₂), ..., (x_p, y_p) are stored into the network. Then if x_k is presented to the system, y_k results. It is done is a robust and error tolerant way. In general framework, we can think of solving the learning problem by error minimization, for example,

min_w S_k=1,2,...,p| y_k - f(w x_k) |²

where | ... | is the 2-norm (Euclidean distance). w is the synaptic strength vector. The problem is made easy if the function f is linear, e.g. f(x) = x. This becomes the standard linear least-squares problem. The standard numerical technique can be applied. The Widrow-Hoff's gradient descent learning rule

W <- W + e (y - Wx) x^T

is a special case when the network is linear.

(5) Back-Propagation Algorithm

The idea of minimizing error can be pursued further for the general multi-layer network. A two-layer network, with input vector x, mid-layer signal z produces out y. Thus we have

z = f(W¹x), y = f(W²z)

It is understood that f(x) is a vector if x is a vector. The ith component if f(x) is just [f(x)]_i = f(x_i). Let the target vector be t (denoted by y^corr previously), the error that we wish to minimize is

e² = | t- y |² = S_i (t_i - y_i)²

Note that e depends on W implicitly through y and then through z, since y = f(W²f(W¹x)). The change to W is made according to steepest decent

w_ij <- w_ij - e de²/dw_ij

where d denotes partial derivative, and e is the learning rate.

Question: Show that above change decreases the error for sufficiently small learning rate e.

Depending on if we are taking derivatives with respect to W¹ or W², we have different expressions for the partial derivatives. After some algebra, we find for the output layer (last layer)

de²/dw_ij= - 2 (t-y)_i [f'(W²z)]_i z_j= - d_i z_j

and for the hidden layer

de²/dw_ij= - [d^T W² ]_i [f'(W¹x)]_i x_j

This last equation can be generalized for many number of layers.

Question: What do the above two equations tell you if the network is linear?