Let the incoming signals be x_{i} (being 0 or 1), and
outgoing signal be y, almost all neural network models assume:

y = f(S w_{i}x_{i}-s)

where s is a threshold value, w is synaptic strength, the
function f(x) is 0 for very small x and 1 for large x. For example, we
can use a step function h(x). McCulloch and Pitts showed that within this
framework, any arbitrary logical function y(x_{1}, x_{2}, ..., x_{n})
can be realized by properly choosing the constants w, s, and connections of the
"neurons".

"Logical circuit" (h(x) is the step function, h(x) = 0 if x < 0, and h(x) =1 if x >= 0)

(a) inverter: y = h(-2x + 1)

(b) logical AND: y = h(x_{1} + x_{2} - 1.5)

Question: how to build a logical OR using inverter and logical AND only? How to build exclusive OR?

McCulloch & Pitts model does not address the issue of learning, and in addition, the logical operation is not error tolerant.

In his famous book "Organization of Behavior", Hebb proposed the idea that connection between two neutrons is plastic; the synaptic strength w changes proportional to the activities correlation between the pre-synaptic and post-synaptic cells. A simple mathematical model is to change w according to

w_{i} <- w_{i} + e
y(x) x_{i}

Hebb's idea can be used to solve a classification problem. This consists of two phases. First the network is trained by examples. After the training session, the synaptic strengths are fixed, the output from some input is then taking as the result of the "neutral computing".

In a single layer neural network, let row vector x (take real value, dimension N) be the input, and row vector y (take a binary pattern, dimension M) be the output, the matrix W of M by N is updated in each training example as

W
<- W + e (y^{corr} - y) x^{T}

where x^{T }is the transpose of x, i.e. a column
vector. If solution exists, it can be shown that this algorithm converges for sufficiently small
e in a finite number of steps. It turns out that the single-layer
network has severe limitations. For example, an exclusive OR can not be
realized. Thus multi-layer network is introduced.

Memory
plays an extremely important role in brain function. The associative memory
network is very much like the classification problem. Consider that p
pairs of "information" (x_{1},y_{1}), (x_{2},
y_{2}), ..., (x_{p}, y_{p}) are stored into the
network. Then if x_{k} is presented to the system, y_{k}
results. It is done is a robust and error tolerant way. In
general framework, we can think of solving the learning problem by error
minimization, for example,

min_{w}
S_{k=1,2,...,p }| y_{k} - f(w x_{k})
|^{2}

where | ... | is the 2-norm (Euclidean distance). w is the synaptic strength vector. The problem is made easy if the function f is linear, e.g. f(x) = x. This becomes the standard linear least-squares problem. The standard numerical technique can be applied. The Widrow-Hoff's gradient descent learning rule

W <- W + e
(y - Wx) x^{T}

is a special case when the network is linear.

The idea of minimizing error can be pursued further for the general multi-layer network. A two-layer network, with input vector x, mid-layer signal z produces out y. Thus we have

z = f(W^{1}x),
y = f(W^{2}z)

It is understood that f(x) is a vector if x is a vector.
The ith component if f(x) is just [f(x)]_{i} = f(x_{i}).
Let the target vector be t (denoted by y^{corr} previously), the error that we wish to minimize is

e^{2} = | t- y |^{2}
= S_{i} (t_{i} - y_{i})^{2}

Note that e depends on W implicitly through y and then through
z, since y = f(W^{2}f(W^{1}x)). The change to W is made according to steepest decent

w_{ij}
<- w_{ij} - e de^{2}/dw_{ij}

where d denotes partial derivative, and e is the learning rate.

Question: Show that above change decreases the error for sufficiently small learning rate e.

Depending on if we are taking derivatives with respect to W^{1}
or W^{2}, we have different expressions for the partial derivatives. After some algebra, we
find for the output layer (last layer)

de^{2}/dw_{ij
}= - 2 (t-y)_{i} [f'(W^{2}z)]_{i} z_{j }= -
d_{i} z_{j}

and for the hidden layer

de^{2}/dw_{ij
}= - [d^{T} W^{2} ]_{i} [f'(W^{1}x)]_{i}
x_{j}

This last equation can be generalized for many number of layers.

Question: What do the above two equations tell you if the network is linear?