4.Floating Point
Fractional binary numbers
What is $1011.101_2$
Representation: Bit to right of ‘binary point’ represent fractional powers of 2,
Represents rational number: $\sum^i_{k=i}b_k\cdot2^k$
Examples
value  Representation 

$5\frac{3}{4}$  $101.11_2$ 
$2\frac78$  $10.111_2$ 
$1\frac7{16}$  $1.0111_2$ 
Observations
 Divide by 2 by shifting right (unsigned)
 Multiply by 2 by shifting left
 Numbers of form $0.111111_{...2}$ are just below 1.0(Use notation 1.0  $\varepsilon$)
Limitation #1
Can only exactly represent numbers of the form $\frac{x}{2^k}$,
Other rational numbers have repeating bit representations
Value  Representation 

$\frac13$  $0.0101010101\{01\}_{...2}$ 
$\frac{1}{5}$  $0.001100110011\{0011\}_{...2}$ 
Limitation #2
Just on setting of binary point within the w bits.
There’s only so many bits to the left and the right of the binary point,
if we move the binary point to the left, then we can’t represent as many large numbers,
we can only represent small numbers, but we have more precision to the right of the binary point,
so we can represent more fractional values, just the range of those values will be much smaller.
Numerical Form:
$(1)^sM2^E$

Sign bit s determines whether number is negative or positive

Significand M normally a fractional value in range [1.0, 2.0)

Exponent E weights value by power of two
Encoding
 MSB s is sign bit s
 exp field encodes E(but is not equal to E)
 frac field encodes M(but is not equal to M)
in the single precision 32 bits, it have one sign bit(there’s always a sign bit) and 8 exp bits and 23 frac bits,
int the double precision it have 11 exp and 52 frac bits.
“Normalized” Values
exp\neq000...0 \or111...1
Exponent coded as a biased values: $e=expbias$
exp: unsigned value of exp field
$bias=2^{k1}1$,where k is number of exponent bits
Single precision: 127(exp:1…254, e:126…127)
we’ve already learned about two’s complement, that’s perfectly way to represent positive and negative numbers, we have exponents that are negative and positive, so why not just use a two’s complement in the exp field to represent those positive and negative exponents?
The smallest negative exponent is represented by all zeros, and the largest exponent is represented by 01…111(?), so the number with the smallest exponent, if we were just to compare the bits, using a just some kind of unsigned representation, just comparing the bits, treating it as an unsigned number, by using this biased representation, we can just compare two floatingpint numbers just as unsigned
Significand coded with implied leading 1: $M = 1.xxx...x_2$
 xxx…xxx: bits of frac field, which is all of the numbers to the right of the binary point right, 1 is implied
 Minimum when frac = 000…0(M = 1.0)
 Maximum when frac = 111…1(M = 2.0  $\varepsilon$)
 Get extra leading bit for “free”(1.)
We’re always going to normalize M, no matter what the number we want to represent, we always going to normalize M as 1.xxx.x, then we adjust the exponent accordingly.
Normalized Encoding Example
$v=(1)^s M 2^E$
$E=ExpBias$
value
float F = 15213.0;
$15213_{10}=11101101101101_2 = 1.1101101101101_2 \times 2^{13}$
Significand
$M = 1.1101101101101_2$
$frac = 11011011011010000000000_2$
Exponent
$E = 13$
$Bias = 127$
$Exp = 140=10001100_2$
Result
s  exp  frac 

0  10001100  11011011011010000000000 
Denormalized Values
Normalized values always have this implied one, when we want to represent numbers closer to zero, that limits us.
Denormalized values is charactered by an exp field of all 0.
$v = (1)^sM2^E$
$E=1Bias$
Cases
exp = 000…0, frac = 000…0: Represents zero value, Note distinct values: +0 and 0
exp = 000…0, frac != 000…0, Numbers closest to 0.0, equispaced.
Special Values
Condition
exp = 111…111
Case: exp = 111…1, frac = 000…0
Represents value $\infin$
Operation that overflows
Both positive and negative
E.g., $\frac{1.0}{0.0}=\frac{1.0}{0.0}=+\infin, \frac{1.0}{0.0}=\infin$
Case: exp = 111…1, frac != 000…0
NotaNumber(NaN)
Represents case when no numeric value can be determined
E.g., sqrt(1), $\infin\infin, \infin\times0$
Tiny Floating Point Example
s  exp  frac 

1  4bits  3bits 
8bit Floating Point Representation
the sign bit is in the most significant bit
the next four bits are the exponent, with a bias of 7
the last three bits are the frac.
Same general form as IEEE Format
normalized, denormalized
representation of 0, NaN, infinity
Dynamic Range (Positive Only)
Special Properties of the IEEE Encoding
FP Zero Same as Integer Zero
All bits = 0
Can (Almost) Use Unsigned Integer Comparison
Must first compare sign bits
Must consider 0 = 0
Otherwise OK.
Floating Point Operations: Basic Idea
$x +_fy=Round(x + y)$
$x \times_fy=Round(x\times y)$
basic idea
First compute exact result
Make it fit into desired precision
Possibly overflow if exponent too large
Possibly round to fit into frac.
Rounding
Nearest Even
1.40  1.60  1.50  2.50  1.50 

1  2  2  2  2 
If the value less than half, then round down; if the value more than half, then round up.
if the value is exactly halfway, then round towards the nearest even number
The reason they chose this is that statistically.
if we have a uniform distribution of sort of numbers, they are going to round up or down about 50% of the times.
So there won’t be a statistical bias rounding up or down one way or the other.
Rounding Binary Numbers
Binary Fractional Numbers
“Even” when least significant bit is 0
“Half way” when bits to right of rounding position = $100..._2$ e
Examples
Round to nearest $\frac{1}{4}$(2 bits right of binary point)
Value  Binary  Rounded  Action  Rounded Value 

$2\frac{3}{32}$  $10.00011_2$  $10.00_2$  <1/2 down  2 
$2\frac{3}{16}$  10.00110  10.01  >1/2 up  $2\frac{1}{4}$ 
$2\frac{7}{8}$  10.11100  11.00  =1/2 up  3 
$2\frac{5}{8}$  10.10100  10.10  =1/2 down  $2\frac{1}{2}$ 
FP Multiplication
$(1)^{s_1}M_12^{E_1}\times(1)^{s_2}M_22^{E_2}$
Exact Result
$(1)^sM2^E$
 Sign s: s1 xor s2
 Significand M: M1 x M2
 Exponent E: E1 + E2
Fixing
 if M >= 2, shift M right, increment E
 if E out of range, overflow
 Round M to fit frac precision
Implementation
Biggest chore is multiplying significands.
Floating Point Addition
$(1)^{s_1}M_12^{E_1}+(1)^{s_2}M_22^{E_2}$
Exact Result
$(1)^sM2^E$
 Sign s, significand M: result of signed align & add.
 Exponent E: E1
Fixing
 if M >= 2, shift M right, increment E.
 if M < 1, shift M left k positions, decrement E by k.
 Overflow if E out of range.
 Round M to fit frac precision.
Mathematical Properties of FP Add
This addition is not associate.
(3.14 + 1e10)  1e10 = 0, 3.14 + (1e10 1e10) = 3.14
…
Floating Point in C
C Guarantees Two Levels
 float: single precision
 double: double precision
Conversions/Casting
Casting between int, float, and double changes bit representation.s
double/float to int
 Truncates fractional part.
 Like rounding toward zero
 Not defined when out of range or NaN: Generally sets to TMin.
int to double
 Exact conversion, as long as int has <= 53 bit word size.
int to float
 will round according to rounding mode.