Floating Point Numbers

Due to practical considerations, computers do not do exact floating point math
In most programming situations, this is not an issue, but our calculations in numerical methods can be affected

IEEE 754 (1985) is the modern floating point number specification

$F (k, B, N_{min}, N_{max}) = \pm (\frac{d _{1}}{B} + \frac{d _{2}}{B ^{2}} + \dots + \frac{d _{k}}{B ^{k}}) \cdot B^{n}$
$= \pm 0. d_{1} d_{2} \dots d_{k} \cdot B^{n}$
$= \pm d_{1} d_{2} \dots d_{k} \cdot B^{n - k}$
where $0 \leq d_{i} < B$ , $d_{1} > 0$ , $N_{min} \leq n \leq N_{max}$

In these equations, $d_{i} d_{j}$ refers to concatenation

The smallest number that can be stored is $B^{N_{min} - 1}$ and the largest is $(1 - B^{- k}) B^{N_{max}}$

How do we round a real number to a floating point number? Set it to the closest floating point number, and if it’s equidistant then set it to whichever ends in an even digit (and if it’s outside bounds then we get an overflow or underflow)

Suppose $\tilde{r}$ approximates $r$ , the absolute error is $∣ \tilde{r} - r ∣$ and the relative error is $\frac{∣ r ~ - r ∣}{∣ r ∣}$

We define machine accuracy $ϵ_{m} = \frac{1}{2} B^{1 - k}$ (also known as unit roundoff) as the smallest positive floating point number that when added to 1 produces a different number

The standard is called double precision which takes 64 bits, $B = 2$ , $k = 53$ , $N_{min} = - 1021$ , $N_{max} = 1024$

Theorem: If $x \in R$ and $∣ x ∣$ lies between the smallest and largest positive numbers in $F$ then $fl (x) = x (1 + δ)$ , $∣ δ ∣ < ϵ_{m}$ or in other words, the relative number is less than $ϵ_{m}$

How do we do operations within $F$ ? Our floating point numbers are not closed under regular operations, so we must define floating point operations $\oplus, ⊖, \otimes, ⊘$

Essentially, we just round the result of each operation to keep the result within $F$ , and on an individual basis these operations behave well in terms of error

However, we must be mindful that errors can propagate and sometimes become problematic

Binyamin's Notes

Explorer