Simulated Annealing


A class of difficult problems in optimization can be formulated as neural network problems.  Approximate solution techniques for such problems are closely related to statistical physics.  Such techniques, simulated annealing being the most important one, are used in the field of neural network very successfully.  The topics covered in this Chapter are closed related to, or may we say borrowed from, statistical mechanics.  It is instructive that we give a very brief introduction of statistical mechanics.

Statistical Mechanics in a Nutshell

Statistical mechanics deals with systems that can take a discrete or continuous set of state, X, known as state space.  In the context of neural network of Hopfield model, X is a particular neural state or spin state for all the neurons involved.  A state appears at random with a probability distribution.  One of the most convenient probability distribution for computation is known as Boltzmann or Gibbs (canonical) distribution.  It is given by the formula

        P(X) = exp(-E(X)/(kT))/Z

where the normalization factor

        Z = SX exp(-E(X)/(kT))

is called partition function.  The symbol k denotes the Boltzmman constant, a physical constant relating energy to temperature.  T is temperature in the absolute scale.   E(X) is the energy of the state X.  In statistical mechanics, we are interested in statistical averages, which can be measured experimentally and compared with theory.  For example, we can compute the average energy,

        U = <E> = SX E(X) P(X)

The derivative of energy with respect to temperature is the heat capacity, it can be shown with some algebra that

        C = dU/dT = ( <E2> - <E>2 )/kT2.                                      (1)

Another important quantity is the entropy of the system

        S = (U - F)/T = - k <ln P(X)>,

where F = -kT ln Z is known as free energy.

Simulated Annealing

Anneal is a process used in metallurgy to treat the materials for better structure and mechanical properties.   The metal or other materials are heated up to high temperature, and then the temperature is reduced slowly to room temperature.   The opposite of annealing is quench, where temperature is reduced suddenly to very low. 

Simulated annealing, of course, refers to the annealing process done one a computer by simulation.   The parameter T in the model is reduced slowly in order to get the information at T = 0, which is difficult to do directly. 

To motivate the method better, let us consider the Weighted Match Problem.  Consider a set of N point with a known ``distance'' dij between each pair.   The problem is to link the points together in pairs, with each point linked to exactly one other point, so as to minimize the total length of the links.

We formulate the problem in terms of a stochastic McCulloch-Pitts network using 0 and 1 as the values each neuron can take.  We assign a unit (neuron) nij, i < j, to each pair of the sites in the problem, making N(N-1)/2 units in all.  Each candidate solution of the problem corresponding to a state of the network with nij = 1 if there is a link connecting point i and point j and nij = 0 if there is not.  Our problem is to find a way of specifying the values of the connection strengths wij,kl between the units.

The quantity that we wish to minimize is

        L = Si<j dij nij

subject to the constraint that

        Sj nij = 1 for all i.

This constraint says that each point should be connected to exactly one other point; define nij = nji when j < i and take nii = 0.  It is difficult to enforce the constraint rigidly from the beginning.  Our strategy is instead to add a penalty term to the energy which is minimized when the constraint is satisfied.  Thus we write down a total effective energy or cost function as:

        E[n] = Si<j dij nij + (g/2) (Sj nij - 1)2.

We can now simulate this model with a Metropolis algorithm or heat-bath (also known as Gibbs sampling) algorithm.   There are two parameters which we have to adjust carefully.  The first is g.  The penalty parameter g should go to infinity so that the constraints are strictly enforced for the final results.  We also take temperature T to 0, starting from a higher temperature. 

We'll discuss the possible algorithms in the next second, but let us have more words about simulated annealing.  This refers to the process of making T approaching to 0, starting from some larger value.   It can be proving that if we make T approaching 0 sufficiently slowly, we can reach the global minimum of the problem, which is the desired result.   However, the slowness required for the correct answer is too impractical.   Typically, we get only approximations to the exact solution.  There is also no good way of checking if we have reached a global minimum.  As an example, we can consider decreasing T by some power law:

        T = T0 /ta,

where T0 is starting temperature at step t = 1, a is some constant (e.g. 1).  Other possibility is to decrease T by a constant factor, T <- 0.95 T; this amounts to an exponential decreasing in temperature.   The way how T is decreased is known as annealing schedule.

Question:  (A) in the above problem, why don't we use zero temperature simulation T = 0 directly?  (B) derive equation that (1) relates heat capacity to energy fluctuation.

Reading: Chap 11 of  S Haykin.