4.Floating Point
Fractional binary numbers
What is
Representation: Bit to right of ‘binary point’ represent fractional powers of 2,
Represents rational number:
Examples
value | Representation |
---|---|
Observations
- Divide by 2 by shifting right (unsigned)
- Multiply by 2 by shifting left
- Numbers of form are just below 1.0(Use notation 1.0 - )
Limitation #1
Can only exactly represent numbers of the form ,
Other rational numbers have repeating bit representations
Value | Representation |
---|---|
Limitation #2
Just on setting of binary point within the w bits.
There’s only so many bits to the left and the right of the binary point,
if we move the binary point to the left, then we can’t represent as many large numbers,
we can only represent small numbers, but we have more precision to the right of the binary point,
so we can represent more fractional values, just the range of those values will be much smaller.
Numerical Form:
-
Sign bit s determines whether number is negative or positive
-
Significand M normally a fractional value in range [1.0, 2.0)
-
Exponent E weights value by power of two
Encoding
- MSB s is sign bit s
- exp field encodes E(but is not equal to E)
- frac field encodes M(but is not equal to M)
in the single precision 32 bits, it have one sign bit(there’s always a sign bit) and 8 exp bits and 23 frac bits,
int the double precision it have 11 exp and 52 frac bits.
“Normalized” Values
exp\neq000...0 \or111...1
Exponent coded as a biased values:
exp: unsigned value of exp field
,where k is number of exponent bits
Single precision: 127(exp:1…254, e:-126…127)
we’ve already learned about two’s complement, that’s perfectly way to represent positive and negative numbers, we have exponents that are negative and positive, so why not just use a two’s complement in the exp field to represent those positive and negative exponents?
The smallest negative exponent is represented by all zeros, and the largest exponent is represented by 01…111(?), so the number with the smallest exponent, if we were just to compare the bits, using a just some kind of unsigned representation, just comparing the bits, treating it as an unsigned number, by using this biased representation, we can just compare two floating-pint numbers just as unsigned
Significand coded with implied leading 1:
- xxx…xxx: bits of frac field, which is all of the numbers to the right of the binary point right, 1 is implied
- Minimum when frac = 000…0(M = 1.0)
- Maximum when frac = 111…1(M = 2.0 - )
- Get extra leading bit for “free”(1.)
We’re always going to normalize M, no matter what the number we want to represent, we always going to normalize M as 1.xxx.x, then we adjust the exponent accordingly.
Normalized Encoding Example
value
float F = 15213.0;
Significand
Exponent
Result
s | exp | frac |
---|---|---|
0 | 10001100 | 11011011011010000000000 |
Denormalized Values
Normalized values always have this implied one, when we want to represent numbers closer to zero, that limits us.
Denormalized values is charactered by an exp field of all 0.
Cases
exp = 000…0, frac = 000…0: Represents zero value, Note distinct values: +0 and -0
exp = 000…0, frac != 000…0, Numbers closest to 0.0, equispaced.
Special Values
Condition
exp = 111…111
Case: exp = 111…1, frac = 000…0
Represents value
Operation that overflows
Both positive and negative
E.g.,
Case: exp = 111…1, frac != 000…0
Not-a-Number(NaN)
Represents case when no numeric value can be determined
E.g., sqrt(-1),
Tiny Floating Point Example
s | exp | frac |
---|---|---|
1 | 4-bits | 3-bits |
8-bit Floating Point Representation
the sign bit is in the most significant bit
the next four bits are the exponent, with a bias of 7
the last three bits are the frac.
Same general form as IEEE Format
normalized, denormalized
representation of 0, NaN, infinity
Dynamic Range (Positive Only)
Special Properties of the IEEE Encoding
FP Zero Same as Integer Zero
All bits = 0
Can (Almost) Use Unsigned Integer Comparison
Must first compare sign bits
Must consider -0 = 0
Otherwise OK.
Floating Point Operations: Basic Idea
basic idea
First compute exact result
Make it fit into desired precision
Possibly overflow if exponent too large
Possibly round to fit into frac.
Rounding
Nearest Even
1.40 | 1.60 | 1.50 | 2.50 | -1.50 |
---|---|---|---|---|
1 | 2 | 2 | 2 | -2 |
If the value less than half, then round down; if the value more than half, then round up.
if the value is exactly halfway, then round towards the nearest even number
The reason they chose this is that statistically.
if we have a uniform distribution of sort of numbers, they are going to round up or down about 50% of the times.
So there won’t be a statistical bias rounding up or down one way or the other.
Rounding Binary Numbers
Binary Fractional Numbers
“Even” when least significant bit is 0
“Half way” when bits to right of rounding position = e
Examples
Round to nearest (2 bits right of binary point)
Value | Binary | Rounded | Action | Rounded Value |
---|---|---|---|---|
<1/2 down | 2 | |||
10.00110 | 10.01 | >1/2 up | ||
10.11100 | 11.00 | =1/2 up | 3 | |
10.10100 | 10.10 | =1/2 down |
FP Multiplication
Exact Result
- Sign s: s1 xor s2
- Significand M: M1 x M2
- Exponent E: E1 + E2
Fixing
- if M >= 2, shift M right, increment E
- if E out of range, overflow
- Round M to fit frac precision
Implementation
Biggest chore is multiplying significands.
Floating Point Addition
Exact Result
- Sign s, significand M: result of signed align & add.
- Exponent E: E1
Fixing
- if M >= 2, shift M right, increment E.
- if M < 1, shift M left k positions, decrement E by k.
- Overflow if E out of range.
- Round M to fit frac precision.
Mathematical Properties of FP Add
This addition is not associate.
(3.14 + 1e10) - 1e10 = 0, 3.14 + (1e10 -1e10) = 3.14
…
Floating Point in C
C Guarantees Two Levels
- float: single precision
- double: double precision
Conversions/Casting
Casting between int, float, and double changes bit representation.s
double/float to int
- Truncates fractional part.
- Like rounding toward zero
- Not defined when out of range or NaN: Generally sets to TMin.
int to double
- Exact conversion, as long as int has <= 53 bit word size.
int to float
- will round according to rounding mode.