Fractional binary numbers

What is 1011.10121011.101_2

Representation: Bit to right of ‘binary point’ represent fractional powers of 2,

Represents rational number: k=iibk2k\sum^i_{k=-i}b_k\cdot2^k


value Representation
5345\frac{3}{4} 101.112101.11_2
2782\frac78 10.111210.111_2
17161\frac7{16} 1.011121.0111_2


  • Divide by 2 by shifting right (unsigned)
  • Multiply by 2 by shifting left
  • Numbers of form 0.111111...20.111111_{...2} are just below 1.0(Use notation 1.0 - ε\varepsilon)

Limitation #1

Can only exactly represent numbers of the form x2k\frac{x}{2^k},

Other rational numbers have repeating bit representations

Value Representation
13\frac13 0.0101010101{01}...20.0101010101\{01\}_{...2}
15\frac{1}{5} 0.001100110011{0011}...20.001100110011\{0011\}_{...2}

Limitation #2

Just on setting of binary point within the w bits.

There’s only so many bits to the left and the right of the binary point,

if we move the binary point to the left, then we can’t represent as many large numbers,

we can only represent small numbers, but we have more precision to the right of the binary point,

so we can represent more fractional values, just the range of those values will be much smaller.

Numerical Form:


  • Sign bit s determines whether number is negative or positive

  • Significand M normally a fractional value in range [1.0, 2.0)

  • Exponent E weights value by power of two


  • MSB s is sign bit s
  • exp field encodes E(but is not equal to E)
  • frac field encodes M(but is not equal to M)

in the single precision 32 bits, it have one sign bit(there’s always a sign bit) and 8 exp bits and 23 frac bits,

int the double precision it have 11 exp and 52 frac bits.

“Normalized” Values

exp\neq000...0 \or111...1

Exponent coded as a biased values: e=expbiase=exp-bias

exp: unsigned value of exp field

bias=2k11bias=2^{k-1}-1,where k is number of exponent bits

Single precision: 127(exp:1…254, e:-126…127)

we’ve already learned about two’s complement, that’s perfectly way to represent positive and negative numbers, we have exponents that are negative and positive, so why not just use a two’s complement in the exp field to represent those positive and negative exponents?

The smallest negative exponent is represented by all zeros, and the largest exponent is represented by 01…111(?), so the number with the smallest exponent, if we were just to compare the bits, using a just some kind of unsigned representation, just comparing the bits, treating it as an unsigned number, by using this biased representation, we can just compare two floating-pint numbers just as unsigned

Significand coded with implied leading 1: =

  • xxx…xxx: bits of frac field, which is all of the numbers to the right of the binary point right, 1 is implied
  • Minimum when frac = 000…0(M = 1.0)
  • Maximum when frac = 111…1(M = 2.0 - ε\varepsilon)
  • Get extra leading bit for “free”(1.)

We’re always going to normalize M, no matter what the number we want to represent, we always going to normalize M as, then we adjust the exponent accordingly.

Normalized Encoding Example

v=(1)sM2Ev=(-1)^s M 2^E



float F = 15213.0;

1521310=111011011011012=1.11011011011012×21315213_{10}=11101101101101_2 = 1.1101101101101_2 \times 2^{13}


M=1.11011011011012M = 1.1101101101101_2

frac=110110110110100000000002frac = 11011011011010000000000_2


E=13E = 13

Bias=127Bias = 127

Exp=140=100011002Exp = 140=10001100_2


s exp frac
0 10001100 11011011011010000000000

Denormalized Values

Normalized values always have this implied one, when we want to represent numbers closer to zero, that limits us.

Denormalized values is charactered by an exp field of all 0.

v=(1)sM2Ev = (-1)^sM2^E



exp = 000…0, frac = 000…0: Represents zero value, Note distinct values: +0 and -0

exp = 000…0, frac != 000…0, Numbers closest to 0.0, equispaced.

Special Values


exp = 111…111

Case: exp = 111…1, frac = 000…0

Represents value \infin

Operation that overflows

Both positive and negative

E.g., 1.00.0=1.00.0=+,1.00.0=\frac{1.0}{0.0}=\frac{-1.0}{-0.0}=+\infin, \frac{1.0}{-0.0}=-\infin

Case: exp = 111…1, frac != 000…0


Represents case when no numeric value can be determined

E.g., sqrt(-1), ,×0\infin-\infin, \infin\times0

Tiny Floating Point Example

s exp frac
1 4-bits 3-bits

8-bit Floating Point Representation

the sign bit is in the most significant bit

the next four bits are the exponent, with a bias of 7

the last three bits are the frac.

Same general form as IEEE Format

normalized, denormalized

representation of 0, NaN, infinity

Dynamic Range (Positive Only)

Special Properties of the IEEE Encoding

FP Zero Same as Integer Zero

All bits = 0

Can (Almost) Use Unsigned Integer Comparison

Must first compare sign bits

Must consider -0 = 0

Otherwise OK.

Floating Point Operations: Basic Idea

x+fy=Round(x+y)x +_fy=Round(x + y)

x×fy=Round(x×y)x \times_fy=Round(x\times y)

basic idea

First compute exact result

Make it fit into desired precision

Possibly overflow if exponent too large

Possibly round to fit into frac.


Nearest Even

1.40 1.60 1.50 2.50 -1.50
1 2 2 2 -2

If the value less than half, then round down; if the value more than half, then round up.

if the value is exactly halfway, then round towards the nearest even number

The reason they chose this is that statistically.

if we have a uniform distribution of sort of numbers, they are going to round up or down about 50% of the times.

So there won’t be a statistical bias rounding up or down one way or the other.

Rounding Binary Numbers

Binary Fractional Numbers

“Even” when least significant bit is 0

“Half way” when bits to right of rounding position = 100...2100..._2 e


Round to nearest 14\frac{1}{4}(2 bits right of binary point)

Value Binary Rounded Action Rounded Value
23322\frac{3}{32} 10.00011210.00011_2 10.00210.00_2 <1/2 down 2
23162\frac{3}{16} 10.00110 10.01 >1/2 up 2142\frac{1}{4}
2782\frac{7}{8} 10.11100 11.00 =1/2 up 3
2582\frac{5}{8} 10.10100 10.10 =1/2 down 2122\frac{1}{2}

FP Multiplication


Exact Result


  • Sign s: s1 xor s2
  • Significand M: M1 x M2
  • Exponent E: E1 + E2


  • if M >= 2, shift M right, increment E
  • if E out of range, overflow
  • Round M to fit frac precision


Biggest chore is multiplying significands.

Floating Point Addition


Exact Result


  • Sign s, significand M: result of signed align & add.
  • Exponent E: E1


  • if M >= 2, shift M right, increment E.
  • if M < 1, shift M left k positions, decrement E by k.
  • Overflow if E out of range.
  • Round M to fit frac precision.

Mathematical Properties of FP Add

This addition is not associate.

(3.14 + 1e10) - 1e10 = 0, 3.14 + (1e10 -1e10) = 3.14

Floating Point in C

C Guarantees Two Levels

  • float: single precision
  • double: double precision


Casting between int, float, and double changes bit representation.s

double/float to int

  • Truncates fractional part.
  • Like rounding toward zero
  • Not defined when out of range or NaN: Generally sets to TMin.

int to double

  • Exact conversion, as long as int has <= 53 bit word size.

int to float

  • will round according to rounding mode.