Due to practical considerations, computers do not do exact floating point math
In most programming situations, this is not an issue, but our calculations in numerical methods can be affected
IEEE 754 (1985) is the modern floating point number specification
where , ,
In these equations, refers to concatenation
The smallest number that can be stored is and the largest is
How do we round a real number to a floating point number? Set it to the closest floating point number, and if it’s equidistant then set it to whichever ends in an even digit (and if it’s outside bounds then we get an overflow or underflow)
Suppose approximates , the absolute error is and the relative error is
We define machine accuracy (also known as unit roundoff) as the smallest positive floating point number that when added to 1 produces a different number
The standard is called double precision which takes 64 bits, , , ,
Theorem: If and lies between the smallest and largest positive numbers in then , or in other words, the relative number is less than
How do we do operations within ? Our floating point numbers are not closed under regular operations, so we must define floating point operations
Essentially, we just round the result of each operation to keep the result within , and on an individual basis these operations behave well in terms of error
However, we must be mindful that errors can propagate and sometimes become problematic