Due to practical considerations, computers do not do exact floating point math
In most programming situations, this is not an issue, but our calculations in numerical methods can be affected

IEEE 754 (1985) is the modern floating point number specification




where , ,

In these equations, refers to concatenation

The smallest number that can be stored is and the largest is

How do we round a real number to a floating point number? Set it to the closest floating point number, and if it’s equidistant then set it to whichever ends in an even digit (and if it’s outside bounds then we get an overflow or underflow)

Suppose approximates , the absolute error is and the relative error is

We define machine accuracy (also known as unit roundoff) as the smallest positive floating point number that when added to 1 produces a different number

The standard is called double precision which takes 64 bits, , , ,

Theorem: If and lies between the smallest and largest positive numbers in then , or in other words, the relative number is less than

How do we do operations within ? Our floating point numbers are not closed under regular operations, so we must define floating point operations

Essentially, we just round the result of each operation to keep the result within , and on an individual basis these operations behave well in terms of error

However, we must be mindful that errors can propagate and sometimes become problematic