Links to other weeks notes [ 2 ][ 3 ][ 4 ][ 5 ][ 6 ][ 7 ] [ 8 ]( 10 )[ 11 ][ 12 ]


Week 10, 15/18 September, 1997

Floating Point

Floating point numbers are very important for scientific computation. Floating point numbers are known as reals in mathematics. Recall the concepts of integer, rational number, and real number in mathematics. Here are some example of reals:

   pi = 3.14159265 ...
   e = 2.71828 ...
   0.000000001 = 1.0 x 10^{-9}
   3155760000 = 3.15576 x 10^9
Real numbers above are written in base 10. So the number pi really means
   3.14159265 = 3x10^0 + 1x10^{-1} + 4x10^{-2} + 1x10^{-3} + 5x10^{-4} ...
or a different number
   103.042 = 1x10^2 + 0x10^1 + 3x10^0 + 0x10^{-1} + 4x10^{-2} + 2x10^{-3}
The digits before the decimal point contribute positive power of 10, while the digits after the point contribute to negative power of 10.

Real numbers in base 2 (binary)

The same idea can be generalized to base 2. An arbitrary real number x can be represented as a sum of powers of 2 with positive and negative integer power.

   x = ... + b_2 2^2 + b_1 2^1 + b_0 2^0 + b_{-1} 2^{-1} + b_{-2} 2^{-2} + ...

where "_" denotes subscript, and "^" for superscript.

Example: Convert binary real number 101.01011 into decimal notation.

     101.01011
   = 2^2 + 2^0 + 2^{-2} + 2^{-4} + 2^{-5}
   = 4 + 1 + 1/4 + 1/16 + 1/32
   = 5 + 0.25 + 0.0625 + 0.03125
   = 5.34375

Converting decimal fraction to binary

For a real number like 15.34375 to binary, the integer part 15 can be converted to binary by the method of successive division by 2. For the fractional part. The method is successive multiplication by 2. Here is an example:
   0.34375 x 2 = 0.6875        ->  0
   0.6875  x 2 = 1.375         ->  1
   0.375   x 2 = 0.75          ->  0
   0.75    x 2 = 1.5           ->  1
   0.5     x 2 = 1.0           ->  1
   0.0     x 2 = 0.0            
So the binary digits for 15.34375 is 1111.01011.

Why the method works? Imagining that the number is written as a sum of powers of 2, each time we multiply by 2, the bits get shift up to the left. If b_0 is 0, the number must be small than 1, if b_0 is 1, the number must be greater than 1. But b0 is one of the binary digits that get shifted up. In the case of division method for the integer part, each division by 2 shift the bit down to the right.

Example: Write the number pi = 3.14159265 ... in binary.

First, the integer part is 3 or 11 in binary. The fractional part 0.14159265 can be find by the method of multiplying 2.

   0.14159265 x 2 = 0.28318531        -> 0 
   0.28318531 x 2 = 0.56637061        -> 0 
   0.56637061 x 2 = 1.13274123        -> 1 
   0.13274123 x 2 = 0.26548246        -> 0 
   0.26548246 x 2 = 0.53096491        -> 0 
   0.53096491 x 2 = 1.06192983        -> 1 
   0.06192983 x 2 = 0.12385966        -> 0 
   0.12385966 x 2 = 0.24771932        -> 0 
   0.24771932 x 2 = 0.49543864        -> 0 
   0.49543864 x 2 = 0.99087728        -> 0 
   0.99087728 x 2 = 1.98175455        -> 1 
   0.98175455 x 2 = 1.96350910        -> 1 
   0.96350910 x 2 = 1.92701821        -> 1 
Thus we have
   pi = 11.0010010000111..._two

Convert binary real number 11.0010010000111... into decimal.

Integer part 11_two is 3_ten. Fractional part

   0.0010010000111 = 2^{-3} + 2^{-6} + 2^{-11} + 2^{-12} + 2^{-13}
   = 0.125 + 0.015625 + 0.00048826 + 0.00024414 + 0.0001220703125
   = 0.14147 ...
So in decimal, it is 3.141...

Problem: What is decimal number 10.1_ten in binary?

Normalized scientific notation

In decimal when the number is too big or too small, we usually use the scientific notation with a power of 10. A generalization to binary scientific notation is of (normalized) form

   1.xxxxx times 2^{yyyy}
We could write
   pi = 1.1001001000011 x 2^1  (normalized form)
      = 11.001001000011 x 2^0
      = 1100.1001000011 x 2^{-2}
      = 0.00011001001000011 x 2^5

We have used decimal notation for the power. This example also show the origin of the name floating point: where to put the (binary) point depends on the value of the power. A normalize form is the one that is always "1.x".

Floating Point Number Format (IEEE 754 Standard)

Two types of floating points formats are used on MIPS and other computers. The single precision number takes one word (32 bits) and double precision number takes two words (64 bits). They correspond to the data type float and double in C. The formats store binary real numbers in normalized scientific notation. Single precision format is like this
          s     exponent     significand
        1 bit    8 bits        23 bits
For double precision, 11 bits is used for exponent and 52 bits are for significand. The meaning of each bit field is

Value of a single precision number

Given a bit pattern
    31  30 29 ... 23  22  21 ...        ...   1   0        (bit position)
   | s |  exponent   | b_1 b_2               b_22 b_23 |   (name)
             e                     f                       (notation)
  1 bit    8 bits               23 bits
where s is a one bit number taking 0 (positive) or 1 (negative) for the sign of the floating point number; e (exponent) is a 8-bit unsigned number varies from 0 to 255. The rest of the bits we call them b_1, b_2, ..., b_23. f is defined as
   f = b_1 2^{-1} + b_2 2^{-2} + ... + b_22 2^{-22} + b_23 2^{-23}
The value of the bit pattern interpreted as a single precision floating point number is
   (-1)^s x ( 1 + f ) x 2^{e-127}
The number 127 in the exponent is called a bias. (The exponent takes 11 bits and the bias is 1023 in double precision).

Here are some examples:

What do the bit patterns represent?

   a = 0 01111111 00000000000000000000000
   b = 1 10000010 01000000000000000000000

(a), s = 0, e = 01111111 = 127, f = b_1 2^{-1} + ... = 0, so the number is

   (-1)^0 x ( 1 + 0 ) x 2^{127-127} = 1
Namely, unity.

(b), s = 1, e = 10000010 = 128 + 2 = 130, f = 0 x 2^{-1} + 1 x 2^{-2} = 0.25, so the value is

   (-1)^1 x ( 1 + 0.25 ) x 2^{130-127} = - 1.25 x 2^3 = - 10.0

Problem: What is the number just slightly bigger than 1 in IEEE single precision? What is the largest and smallest normalized number in IEEE single precision representation?

The number 1 is represented as

   0 01111111 00000000000000000000000
Next number that is just slightly bigger is
   0 01111111 00000000000000000000001
Namely
   (-1)^0 x (1 + 2^{-23}) x 2^{127 - 127} = 1 + 2^{-23} = 1.0000001192...

The largest number is when e = 254 (not 255), f = 0.999.., or

   (-1)^0 x( 1 + 0.999...) x 2^(254-127) = 2^128 = 3.4 x 10^{38}
The smallest number is when e = 1 (not 0), f = 0, or
   (-1)^0 x ( 1 + 0 ) x 2^(1-127) = 2^{-126} = 1.2 x 10^{-38}
They are very large and very small indeed.

Example: Representation of pi in IEEE single precision format. Since

   pi = 3.1415926...               <- decimal
      = 11.001001000011...         <- binary
      = 1.1001001000011... x 2^1   <- binary, normalized scientific
we have s = 0 (positive), e - 127 = 1, or e = 128, and significand f = 0.100100100011... So the bit pattern is
   0 10000000 1001001000011...

Note that the leading 1 in (pi=1.1001... x 2^1) is implicit, and is not represented in the bit pattern. f contains only the fractional part.

Floating-Point Addition

To illustrate the necessary steps for machine calculation of floating-point addition, we consider adding

   9.999 x 10^1 + 1.610 x 10^{-1}
assuming 4 decimal digits storage for the significand and 2 decimal digits for the exponents.

Step 1: Shift decimal point of the smaller number, so that the power is the same as the larger number.

   1.610 x 10^{-1} = 0.01610 x 10^1 = 0.016 x 10^1

Step 2: Add the significands

   9.999 + 0.016 = 10.015

Step 3: Shift decimal point to normalized scientific notation.

   10.015 x 10^1 = 1.0015 x 10^2 
Check for overflow or underflow of the exponent.

Step 4: Round the number to 4 digits, so the final result is 1.002 x 10^2.

The steps used by computer are similar to what we did above. Let's look at another example: Try adding the numbers 0.5 and -0.4375 in binary using 4-bit precision.

Answer: write the numbers in normalized scientific notation.

   0.5 = 2^{-1} = 1.000 x 2^{-1}
  -0.4375 = -7/16 = -0.0111 = -1.110 x 2^{-2}

Step 1: Binary point of smaller number is shifted left to match the exponent of the large number. -1.110x2^{-2} = -0.111 x 2^{-1}.

Step 2: Add the significands 1.000 + (-0.111) = 0.001.

Step 3: Normalize the sum, check for possible overflow or underflow. 0.001 x 2^{-1} = 1.000 x 2^{-4}. Overflow means the exponent is too large; underflow exponent too small.

Step 4: Round the sum to 4 bits. Not needed in this case. 1.000 x 2^{-4} = 1/2^4 = 0.0625.

Problem: Compute 1.5 + 10.0, follow the floating-point addition steps in binary. Keep 4 bits accuracy.

Floating-Point Multiplication

Again, we use decimal number multiplication in scientific notation as an example. Multiply decimal numbers 1.110 x 10^{10} and 9.200 x 10^{-5}. Assume 4 digits of significand and 2 digits of the exponents.

Step 1: Addition of the exponents, so the new exponent is 10+(-5) = 5.

Step 2: Multiplication of the significands 1.110 x 9.200 = 10.21200.

Step 3: Normalize the result 10.21200 x 10^5 = 1.021200 x 10^6. Also check for overflow (exponent to large) or underflow (exponent too small).

Step 4: Rounding to 4 digits, after rounding, the result is 1.021 x 10^6.

Step 5: Compute the sign of the product. In this case, it is positive.

Example in binary. Multiplying 0.5 and -0.4375, with 4 bits for significand. 0.5 = 1.000 x 2^{-1}, -0.4375 = 1.110 x 2^{-2}.

Step 1: Adding the exponents (-1) + (-2) = -3.

Step 2: Multiplying the significands 1.000 x 1.110 = 1.110000.

Step 3: Normalize the result. 1.110000 x 2^{-3}. Already normalized.

Step 4: Rounding the product 1.110 x 2^{-3}.

Step 5: Compute the sign (+1) x (-1) = -1. So the final result is -1.110 x 2^{-3} = -0.21875.

Suggested Reading

Textbook, Computer Organization & Design, pages 225-238.


[ previous week ][ next week ]