Principle Components Analysis

 

(1) Hebbian leaning rule and Oja's learning rule

Let us have a close look of the Hebb learning rule:

        wi <- wi + e y(x) xi

we shall consider only linear network where y(x) = Si wi xi = x wT, x and w are row vectors.  In matrix notation, we can compactly write

        w <- w + e (w xT) x    (for one neutron, N inputs, one scalar output)

However, this updating rule for w is not stable.  For example, if it happens x>0 and y>0, then repeated application of the above formula will make w go to infinity.  A way to get around this is to limit the norm of the vector w, by requiring that the length of the vector is fixed at |w| = 1, where  |w|2 = Siwi2.  Then the updating consists of two steps, (1)  w <- w + e (w xT) x, (2) w <- w/|w|.    Assuming the learning rate is small (say 10-3), we can combined the two steps into one.  The following modified Hebbian rule is known as Oja's rule

        w <- w + e y ( x - y w),    where  y = x wT.            (3)

the additional term proportional to - y2 w causes the w to decay, thus it will not diverge to infinity.  In fact, it will converge to |w| = 1.

[Work out the steps from equations (1) & (2) to (3), correct to leading order in e.]

(2) Asymptotic limit of the Oja's learning rule

What would be the results if we apply (3) repeatedly?  Of course, that will depends on the training sample x.   To quantify the set of possibly training sample, we assume x is drawn from a fixed probability distribution P(x).   We assume  that x is identically, and independently distributed (iid). 

In the limit of many training samples with small e, we expect that w fluctuates about a mean <w> which we'll still write as w.  the correction term that is proportional to e is zero on average, so we require that

        < y (x - y w) > = < xTx  > w  - (wT <xTx> w) w        (4)

where x is row vector and xT is column vector, w is assumed to be a constant with respect to the average denoted by < ... >.   Let us define  C = < xTx>.  It is known as correlation matrix.  In component form, Cij = <xi xj>.   We can write Eq.(4) as

        C w - (wT C w) w = 0.

If Cw = l w, then C w = l |w|2 w = l w, then we have shown |w| = 1.   The equation

        C w = l w

is called an eigenvalue problem, and l is called eigenvalue and w is eigenvector.   In the limit of large training samples drawn from the distribution P(x), the weight approaches one of the eigenvector of the correlation matrix.   It can be shown that w in fact approaches the eigenvector with largest eigenvalue.  Other directions (eigenvectors) are unstable against small perturbation.

(3) Generalized Hebbian algorithms and principal components

Equation (3) pick out the largest eigenvector corresponding the largest eigenvalue.  A simple generalization can be used to pick out the first M eigenvectors with M linear neurons.   This is known as Sanger's learning rule:

        wk  <-  wk + e yk ( x - Sj=1k yk  wk)

Note that for k = 1, it is the same as Eq.(3).  For k = 2, we have

        w2  <-  w2 + e y2 ( x -  y1  w1 - y2  w2)

When convergence is reached, all w vectors are orthogonal to each other and normalized to 1.  wk is then an eigenvector of C with eigenvalue lk.  The eigenvalues are in decreasing order with index k.

The set of values yk = wk xT are called principal components of x.   The set of directions w are the principal directions in the sample space x for the variance y.   The variance y forms a high dimensional "ellipsoid": (assume <y>=0)

        <y2> = w C wT

the direction w1 corresponds to the direction of largest variation with variance l1.  w2 gives the second largest variance in a direction that is perpendicular to  w2, w3 is third largest variation direction that is perpendicular to w1 and w2, and so on.  Here is in this plot a schematic illustration of the concept. 

 

(4) An image compression example

What is good use of the above learning algorithm?  Such network can be used as a pre-processing layer in a multi-layer network for data reduction.  This will make the whole network faster (in learning).   In such applications,  the number N of input signals will be much larger than output signals M.  the mapping y = W x can be viewed as a form of data compression.

One particular interesting application is image data compression.  Since the set of w's form a new orthogonal axis, we can expand any arbitrary input data x along these axis,

        x = b1 w1 + b2 w2 + b3 w3 + ...

multiplying (scale product with) wk to both side of the equation, we find

        bk = wk xT = yk

In order to go back to x from b's without loss of information, we need N components of b.  However, if we keep only the first M components, then they represent a compressed information.  We test this idea with the example of Mathematica code.