Representation: Bit to right of ‘binary point’ represent fractional powers of 2,
Represents rational number:
- Divide by 2 by shifting right (unsigned)
- Multiply by 2 by shifting left
- Numbers of form are just below 1.0(Use notation 1.0 - )
Can only exactly represent numbers of the form ,
Other rational numbers have repeating bit representations
Just on setting of binary point within the w bits.
There’s only so many bits to the left and the right of the binary point,
if we move the binary point to the left, then we can’t represent as many large numbers,
we can only represent small numbers, but we have more precision to the right of the binary point,
so we can represent more fractional values, just the range of those values will be much smaller.
Sign bit s determines whether number is negative or positive
Significand M normally a fractional value in range [1.0, 2.0)
Exponent E weights value by power of two
- MSB s is sign bit s
- exp field encodes E(but is not equal to E)
- frac field encodes M(but is not equal to M)
in the single precision 32 bits, it have one sign bit(there’s always a sign bit) and 8 exp bits and 23 frac bits,
int the double precision it have 11 exp and 52 frac bits.
exp: unsigned value of exp field
,where k is number of exponent bits
Single precision: 127(exp:1…254, e:-126…127)
we’ve already learned about two’s complement, that’s perfectly way to represent positive and negative numbers, we have exponents that are negative and positive, so why not just use a two’s complement in the exp field to represent those positive and negative exponents?
The smallest negative exponent is represented by all zeros, and the largest exponent is represented by 01…111(?), so the number with the smallest exponent, if we were just to compare the bits, using a just some kind of unsigned representation, just comparing the bits, treating it as an unsigned number, by using this biased representation, we can just compare two floating-pint numbers just as unsigned
- xxx…xxx: bits of frac field, which is all of the numbers to the right of the binary point right, 1 is implied
- Minimum when frac = 000…0(M = 1.0)
- Maximum when frac = 111…1(M = 2.0 - )
- Get extra leading bit for “free”(1.)
We’re always going to normalize M, no matter what the number we want to represent, we always going to normalize M as 1.xxx.x, then we adjust the exponent accordingly.
float F = 15213.0;
Normalized values always have this implied one, when we want to represent numbers closer to zero, that limits us.
Denormalized values is charactered by an exp field of all 0.
exp = 000…0, frac = 000…0: Represents zero value, Note distinct values: +0 and -0
exp = 000…0, frac != 000…0, Numbers closest to 0.0, equispaced.
exp = 111…111
Operation that overflows
Both positive and negative
Represents case when no numeric value can be determined
the sign bit is in the most significant bit
the next four bits are the exponent, with a bias of 7
the last three bits are the frac.
representation of 0, NaN, infinity
All bits = 0
Must first compare sign bits
Must consider -0 = 0
First compute exact result
Make it fit into desired precision
Possibly overflow if exponent too large
Possibly round to fit into frac.
If the value less than half, then round down; if the value more than half, then round up.
if the value is exactly halfway, then round towards the nearest even number
The reason they chose this is that statistically.
if we have a uniform distribution of sort of numbers, they are going to round up or down about 50% of the times.
So there won’t be a statistical bias rounding up or down one way or the other.
“Even” when least significant bit is 0
“Half way” when bits to right of rounding position = e
Round to nearest (2 bits right of binary point)
- Sign s: s1 xor s2
- Significand M: M1 x M2
- Exponent E: E1 + E2
- if M >= 2, shift M right, increment E
- if E out of range, overflow
- Round M to fit frac precision
Biggest chore is multiplying significands.
- Sign s, significand M: result of signed align & add.
- Exponent E: E1
- if M >= 2, shift M right, increment E.
- if M < 1, shift M left k positions, decrement E by k.
- Overflow if E out of range.
- Round M to fit frac precision.
This addition is not associate.
(3.14 + 1e10) - 1e10 = 0, 3.14 + (1e10 -1e10) = 3.14
- float: single precision
- double: double precision
Casting between int, float, and double changes bit representation.s
- Truncates fractional part.
- Like rounding toward zero
- Not defined when out of range or NaN: Generally sets to TMin.
- Exact conversion, as long as int has <= 53 bit word size.
- will round according to rounding mode.
- 1. Fractional binary numbers
- 2. “Normalized” Values
- 3. Normalized Encoding Example
- 4. Denormalized Values
- 5. Special Values
- 6. Tiny Floating Point Example
- 7. Special Properties of the IEEE Encoding
- 8. Floating Point Operations: Basic Idea
- 9. Rounding
- 10. Rounding Binary Numbers
- 11. FP Multiplication
- 12. Floating Point Addition
- 13. Mathematical Properties of FP Add
- 14. Floating Point in C