Links to other weeks notes [ 2 ][ 3 ][ 4 ][ 5 ][ 6 ][ 7 ] [ 8 ]( 10 )[ 11 ][ 12 ]
Floating point numbers are very important for scientific computation. Floating point numbers are known as reals in mathematics. Recall the concepts of integer, rational number, and real number in mathematics. Here are some example of reals:
pi = 3.14159265 ... e = 2.71828 ... 0.000000001 = 1.0 x 10^{-9} 3155760000 = 3.15576 x 10^9Real numbers above are written in base 10. So the number pi really means
3.14159265 = 3x10^0 + 1x10^{-1} + 4x10^{-2} + 1x10^{-3} + 5x10^{-4} ...or a different number
103.042 = 1x10^2 + 0x10^1 + 3x10^0 + 0x10^{-1} + 4x10^{-2} + 2x10^{-3}The digits before the decimal point contribute positive power of 10, while the digits after the point contribute to negative power of 10.
The same idea can be generalized to base 2. An arbitrary real number x can be represented as a sum of powers of 2 with positive and negative integer power.
x = ... + b_2 2^2 + b_1 2^1 + b_0 2^0 + b_{-1} 2^{-1} + b_{-2} 2^{-2} + ...
where "_
" denotes subscript, and "^
" for
superscript.
Example: Convert binary real number 101.01011 into decimal notation.
101.01011 = 2^2 + 2^0 + 2^{-2} + 2^{-4} + 2^{-5} = 4 + 1 + 1/4 + 1/16 + 1/32 = 5 + 0.25 + 0.0625 + 0.03125 = 5.34375
0.34375 x 2 = 0.6875 -> 0 0.6875 x 2 = 1.375 -> 1 0.375 x 2 = 0.75 -> 0 0.75 x 2 = 1.5 -> 1 0.5 x 2 = 1.0 -> 1 0.0 x 2 = 0.0So the binary digits for 15.34375 is 1111.01011.
Why the method works? Imagining that the number is written as a sum of powers of 2, each time we multiply by 2, the bits get shift up to the left. If b_0 is 0, the number must be small than 1, if b_0 is 1, the number must be greater than 1. But b0 is one of the binary digits that get shifted up. In the case of division method for the integer part, each division by 2 shift the bit down to the right.
Example: Write the number pi = 3.14159265 ... in binary.
First, the integer part is 3 or 11 in binary. The fractional part 0.14159265 can be find by the method of multiplying 2.
0.14159265 x 2 = 0.28318531 -> 0 0.28318531 x 2 = 0.56637061 -> 0 0.56637061 x 2 = 1.13274123 -> 1 0.13274123 x 2 = 0.26548246 -> 0 0.26548246 x 2 = 0.53096491 -> 0 0.53096491 x 2 = 1.06192983 -> 1 0.06192983 x 2 = 0.12385966 -> 0 0.12385966 x 2 = 0.24771932 -> 0 0.24771932 x 2 = 0.49543864 -> 0 0.49543864 x 2 = 0.99087728 -> 0 0.99087728 x 2 = 1.98175455 -> 1 0.98175455 x 2 = 1.96350910 -> 1 0.96350910 x 2 = 1.92701821 -> 1Thus we have
pi = 11.0010010000111..._two
Convert binary real number 11.0010010000111... into decimal.
Integer part 11_two is 3_ten. Fractional part
0.0010010000111 = 2^{-3} + 2^{-6} + 2^{-11} + 2^{-12} + 2^{-13} = 0.125 + 0.015625 + 0.00048826 + 0.00024414 + 0.0001220703125 = 0.14147 ...So in decimal, it is 3.141...
Problem: What is decimal number 10.1_ten in binary?
In decimal when the number is too big or too small, we usually use the scientific notation with a power of 10. A generalization to binary scientific notation is of (normalized) form
1.xxxxx times 2^{yyyy}We could write
pi = 1.1001001000011 x 2^1 (normalized form) = 11.001001000011 x 2^0 = 1100.1001000011 x 2^{-2} = 0.00011001001000011 x 2^5
We have used decimal notation for the power. This example also show the origin of the name floating point: where to put the (binary) point depends on the value of the power. A normalize form is the one that is always "1.x".
float
and double
in C.
The formats store binary real numbers in normalized scientific
notation. Single precision format is like this
s exponent significand 1 bit 8 bits 23 bitsFor double precision, 11 bits is used for exponent and 52 bits are for significand. The meaning of each bit field is
31 30 29 ... 23 22 21 ... ... 1 0 (bit position) | s | exponent | b_1 b_2 b_22 b_23 | (name) e f (notation) 1 bit 8 bits 23 bitswhere s is a one bit number taking 0 (positive) or 1 (negative) for the sign of the floating point number; e (exponent) is a 8-bit unsigned number varies from 0 to 255. The rest of the bits we call them b_1, b_2, ..., b_23. f is defined as
f = b_1 2^{-1} + b_2 2^{-2} + ... + b_22 2^{-22} + b_23 2^{-23}The value of the bit pattern interpreted as a single precision floating point number is
(-1)^s x ( 1 + f ) x 2^{e-127}The number 127 in the exponent is called a bias. (The exponent takes 11 bits and the bias is 1023 in double precision).
Here are some examples:
What do the bit patterns represent?
a = 0 01111111 00000000000000000000000 b = 1 10000010 01000000000000000000000
(a), s = 0, e = 01111111 = 127, f = b_1 2^{-1} + ... = 0, so the number is
(-1)^0 x ( 1 + 0 ) x 2^{127-127} = 1Namely, unity.
(b), s = 1, e = 10000010 = 128 + 2 = 130, f = 0 x 2^{-1} + 1 x 2^{-2} = 0.25, so the value is
(-1)^1 x ( 1 + 0.25 ) x 2^{130-127} = - 1.25 x 2^3 = - 10.0
Problem: What is the number just slightly bigger than 1 in IEEE single precision? What is the largest and smallest normalized number in IEEE single precision representation?
The number 1 is represented as
0 01111111 00000000000000000000000Next number that is just slightly bigger is
0 01111111 00000000000000000000001Namely
(-1)^0 x (1 + 2^{-23}) x 2^{127 - 127} = 1 + 2^{-23} = 1.0000001192...
The largest number is when e = 254 (not 255), f = 0.999.., or
(-1)^0 x( 1 + 0.999...) x 2^(254-127) = 2^128 = 3.4 x 10^{38}The smallest number is when e = 1 (not 0), f = 0, or
(-1)^0 x ( 1 + 0 ) x 2^(1-127) = 2^{-126} = 1.2 x 10^{-38}They are very large and very small indeed.
Example: Representation of pi in IEEE single precision format. Since
pi = 3.1415926... <- decimal = 11.001001000011... <- binary = 1.1001001000011... x 2^1 <- binary, normalized scientificwe have s = 0 (positive), e - 127 = 1, or e = 128, and significand f = 0.100100100011... So the bit pattern is
0 10000000 1001001000011...
Note that the leading 1 in (pi=1.1001... x 2^1) is implicit, and is not represented in the bit pattern. f contains only the fractional part.
To illustrate the necessary steps for machine calculation of floating-point addition, we consider adding
9.999 x 10^1 + 1.610 x 10^{-1}assuming 4 decimal digits storage for the significand and 2 decimal digits for the exponents.
Step 1: Shift decimal point of the smaller number, so that the power is the same as the larger number.
1.610 x 10^{-1} = 0.01610 x 10^1 = 0.016 x 10^1
Step 2: Add the significands
9.999 + 0.016 = 10.015
Step 3: Shift decimal point to normalized scientific notation.
10.015 x 10^1 = 1.0015 x 10^2Check for overflow or underflow of the exponent.
Step 4: Round the number to 4 digits, so the final result is 1.002 x 10^2.
The steps used by computer are similar to what we did above. Let's look at another example: Try adding the numbers 0.5 and -0.4375 in binary using 4-bit precision.
Answer: write the numbers in normalized scientific notation.
0.5 = 2^{-1} = 1.000 x 2^{-1} -0.4375 = -7/16 = -0.0111 = -1.110 x 2^{-2}
Step 1: Binary point of smaller number is shifted left to match the exponent of the large number. -1.110x2^{-2} = -0.111 x 2^{-1}.
Step 2: Add the significands 1.000 + (-0.111) = 0.001.
Step 3: Normalize the sum, check for possible overflow or underflow. 0.001 x 2^{-1} = 1.000 x 2^{-4}. Overflow means the exponent is too large; underflow exponent too small.
Step 4: Round the sum to 4 bits. Not needed in this case. 1.000 x 2^{-4} = 1/2^4 = 0.0625.
Problem: Compute 1.5 + 10.0, follow the floating-point addition steps in binary. Keep 4 bits accuracy.
Again, we use decimal number multiplication in scientific notation as an example. Multiply decimal numbers 1.110 x 10^{10} and 9.200 x 10^{-5}. Assume 4 digits of significand and 2 digits of the exponents.
Step 1: Addition of the exponents, so the new exponent is 10+(-5) = 5.
Step 2: Multiplication of the significands 1.110 x 9.200 = 10.21200.
Step 3: Normalize the result 10.21200 x 10^5 = 1.021200 x 10^6. Also check for overflow (exponent to large) or underflow (exponent too small).
Step 4: Rounding to 4 digits, after rounding, the result is 1.021 x 10^6.
Step 5: Compute the sign of the product. In this case, it is positive.
Example in binary. Multiplying 0.5 and -0.4375, with 4 bits for significand. 0.5 = 1.000 x 2^{-1}, -0.4375 = 1.110 x 2^{-2}.
Step 1: Adding the exponents (-1) + (-2) = -3.
Step 2: Multiplying the significands 1.000 x 1.110 = 1.110000.
Step 3: Normalize the result. 1.110000 x 2^{-3}. Already normalized.
Step 4: Rounding the product 1.110 x 2^{-3}.
Step 5: Compute the sign (+1) x (-1) = -1. So the final result is -1.110 x 2^{-3} = -0.21875.
Textbook, Computer Organization & Design, pages 225-238.