## Fractional binary numbers

### What is $1011.101_2$

Representation: Bit to right of ‘binary point’ represent fractional powers of 2,

Represents rational number: $\sum^i_{k=-i}b_k\cdot2^k$

### Examples

value Representation
$5\frac{3}{4}$ $101.11_2$
$2\frac78$ $10.111_2$
$1\frac7{16}$ $1.0111_2$

### Observations

• Divide by 2 by shifting right (unsigned)
• Multiply by 2 by shifting left
• Numbers of form $0.111111_{...2}$ are just below 1.0(Use notation 1.0 - $\varepsilon$)

### Limitation #1

Can only exactly represent numbers of the form $\frac{x}{2^k}$,

Other rational numbers have repeating bit representations

Value Representation
$\frac13$ $0.0101010101\{01\}_{...2}$
$\frac{1}{5}$ $0.001100110011\{0011\}_{...2}$

### Limitation #2

Just on setting of binary point within the w bits.

There’s only so many bits to the left and the right of the binary point,

if we move the binary point to the left, then we can’t represent as many large numbers,

we can only represent small numbers, but we have more precision to the right of the binary point,

so we can represent more fractional values, just the range of those values will be much smaller.

### Numerical Form:

$(-1)^sM2^E$

• Sign bit s determines whether number is negative or positive

• Significand M normally a fractional value in range [1.0, 2.0)

• Exponent E weights value by power of two

### Encoding

• MSB s is sign bit s
• exp field encodes E(but is not equal to E)
• frac field encodes M(but is not equal to M)

in the single precision 32 bits, it have one sign bit(there’s always a sign bit) and 8 exp bits and 23 frac bits,

int the double precision it have 11 exp and 52 frac bits.

## “Normalized” Values

exp\neq000...0 \or111...1

### Exponent coded as a biased values: $e=exp-bias$

exp: unsigned value of exp field

$bias=2^{k-1}-1$,where k is number of exponent bits

Single precision: 127(exp:1…254, e:-126…127)

we’ve already learned about two’s complement, that’s perfectly way to represent positive and negative numbers, we have exponents that are negative and positive, so why not just use a two’s complement in the exp field to represent those positive and negative exponents?

The smallest negative exponent is represented by all zeros, and the largest exponent is represented by 01…111(?), so the number with the smallest exponent, if we were just to compare the bits, using a just some kind of unsigned representation, just comparing the bits, treating it as an unsigned number, by using this biased representation, we can just compare two floating-pint numbers just as unsigned

### Significand coded with implied leading 1: $M = 1.xxx...x_2$

• xxx…xxx: bits of frac field, which is all of the numbers to the right of the binary point right, 1 is implied
• Minimum when frac = 000…0(M = 1.0)
• Maximum when frac = 111…1(M = 2.0 - $\varepsilon$)
• Get extra leading bit for “free”(1.)

We’re always going to normalize M, no matter what the number we want to represent, we always going to normalize M as 1.xxx.x, then we adjust the exponent accordingly.

## Normalized Encoding Example

$v=(-1)^s M 2^E$

$E=Exp-Bias$

### value

float F = 15213.0;

$15213_{10}=11101101101101_2 = 1.1101101101101_2 \times 2^{13}$

### Significand

$M = 1.1101101101101_2$

$frac = 11011011011010000000000_2$

### Exponent

$E = 13$

$Bias = 127$

$Exp = 140=10001100_2$

### Result

s exp frac
0 10001100 11011011011010000000000

## Denormalized Values

Normalized values always have this implied one, when we want to represent numbers closer to zero, that limits us.

Denormalized values is charactered by an exp field of all 0.

$v = (-1)^sM2^E$

$E=1-Bias$

### Cases

exp = 000…0, frac = 000…0: Represents zero value, Note distinct values: +0 and -0

exp = 000…0, frac != 000…0, Numbers closest to 0.0, equispaced.

## Special Values

exp = 111…111

### Case: exp = 111…1, frac = 000…0

Represents value $\infin$

Operation that overflows

Both positive and negative

E.g., $\frac{1.0}{0.0}=\frac{-1.0}{-0.0}=+\infin, \frac{1.0}{-0.0}=-\infin$

### Case: exp = 111…1, frac != 000…0

Not-a-Number(NaN)

Represents case when no numeric value can be determined

E.g., sqrt(-1), $\infin-\infin, \infin\times0$

## Tiny Floating Point Example

s exp frac
1 4-bits 3-bits

### 8-bit Floating Point Representation

the sign bit is in the most significant bit

the next four bits are the exponent, with a bias of 7

the last three bits are the frac.

### Same general form as IEEE Format

normalized, denormalized

representation of 0, NaN, infinity

## Special Properties of the IEEE Encoding

All bits = 0

### Can (Almost) Use Unsigned Integer Comparison

Must first compare sign bits

Must consider -0 = 0

Otherwise OK.

## Floating Point Operations: Basic Idea

$x +_fy=Round(x + y)$

$x \times_fy=Round(x\times y)$

### basic idea

First compute exact result

Make it fit into desired precision

Possibly overflow if exponent too large

Possibly round to fit into frac.

## Rounding

### Nearest Even

1.40 1.60 1.50 2.50 -1.50
1 2 2 2 -2

If the value less than half, then round down; if the value more than half, then round up.

if the value is exactly halfway, then round towards the nearest even number

The reason they chose this is that statistically.

if we have a uniform distribution of sort of numbers, they are going to round up or down about 50% of the times.

So there won’t be a statistical bias rounding up or down one way or the other.

## Rounding Binary Numbers

### Binary Fractional Numbers

“Even” when least significant bit is 0

“Half way” when bits to right of rounding position = $100..._2$ e

### Examples

Round to nearest $\frac{1}{4}$(2 bits right of binary point)

Value Binary Rounded Action Rounded Value
$2\frac{3}{32}$ $10.00011_2$ $10.00_2$ <1/2 down 2
$2\frac{3}{16}$ 10.00110 10.01 >1/2 up $2\frac{1}{4}$
$2\frac{7}{8}$ 10.11100 11.00 =1/2 up 3
$2\frac{5}{8}$ 10.10100 10.10 =1/2 down $2\frac{1}{2}$

## FP Multiplication

$(-1)^{s_1}M_12^{E_1}\times(-1)^{s_2}M_22^{E_2}$

### Exact Result

$(-1)^sM2^E$

• Sign s: s1 xor s2
• Significand M: M1 x M2
• Exponent E: E1 + E2

### Fixing

• if M >= 2, shift M right, increment E
• if E out of range, overflow
• Round M to fit frac precision

### Implementation

Biggest chore is multiplying significands.

$(-1)^{s_1}M_12^{E_1}+(-1)^{s_2}M_22^{E_2}$

### Exact Result

$(-1)^sM2^E$

• Sign s, significand M: result of signed align & add.
• Exponent E: E1

### Fixing

• if M >= 2, shift M right, increment E.
• if M < 1, shift M left k positions, decrement E by k.
• Overflow if E out of range.
• Round M to fit frac precision.

## Mathematical Properties of FP Add

(3.14 + 1e10) - 1e10 = 0, 3.14 + (1e10 -1e10) = 3.14

## Floating Point in C

### C Guarantees Two Levels

• float: single precision
• double: double precision

### Conversions/Casting

Casting between int, float, and double changes bit representation.s

#### double/float to int

• Truncates fractional part.
• Like rounding toward zero
• Not defined when out of range or NaN: Generally sets to TMin.

#### int to double

• Exact conversion, as long as int has <= 53 bit word size.

#### int to float

• will round according to rounding mode.