In this blog post I hope to convince you to turn on -ffast-math
by default and introduce you to the exciting field of error analysis.
The flag -ffast-math
for C-compilers GCC and Clang
allows the compiler to break compliance with the IEEE-754 standard. This
standard describes the rules of floating point computation. For example
must be interpreted as
,
where
is the floating point approximation of addition. Ignoring this
requirement lets us reorder the computation to
.
The computations
and
can be done in parallel, which can massively improve the performance of
your code. For example a factor
for SIMD parallelism, a factor
for ILP, and a factor
–
for multithreading.
It is a logical first thought that this performance improvement must come at the cost of correctness, but fortunately this worry is unfounded. To have a meaningful discussion on correctness, we first have to define what it means to be correct. Clearly, we do not mean bitwise equality. After all, floating point numbers are approximations of real numbers, so equality is too much to ask. Instead, we should look for error bounds. The relative error between and an approximation is the scaled difference or equivalently for .
If we compute the relative error for summing random numbers in , (sum.c) in a standard-compliant way we get a huge relative error
$ gcc -O3 sum.c
$ ./a.out 1000000000
Relative error 9.664454e-01
Using -ffast-math
the compiler will generate SIMD instructions.
This is faster, and more accurate.
$ gcc -O3 -ffast-math -mavx2 -mfma sum.c
$ ./a.out 1000000000
Relative error 7.315683e-01
We can explain this as follows. For floating point addition we have
,
so the error in an update sum = sum
x[i]
is proportional to the variable sum
we accumulate the partial
sums in.
If we let GCC use SIMD instructions, it accumulates partial sums in four
variables, corresponding to the SIMD lanes, and then sums these accumulators
at the end:
. We now only do four addition on an accumulator larger than a quarter of
the sum, instead of
.
We can break associativity even further by using multithreading, which gives a far larger jump in accuracy.
$ gcc -O3 -ffast-math -mavx2 -mfma -fopenmp sum.c
$ ./a.out 1000000000
Relative error 1.791976e-06
A classical worst-case error analysis assumes consecutive rounding errors add up. For the previous examples this was the case because the additions are rounded down when , giving all the same negative sign. For this multithreaded program however, the accumulators never grow so large and the rounding is much more random. This gives rounding errors the chance to cancel each other out.
Breaking IEEE this way reduces the worst-case error bound. That does not mean it is better on all input. As a counterexample to the SIMD sum, consider the repeating sequence . Summing this left to right yields the exact sum , while the SIMD lanes will sum values to We will get a wrong result for large enough. I will leave it as an exercise to the reader to come up with a counterexample that is actually realistic. (Maths-speak for saying I cannot do it.)
There are also algorithms such as
compensated summation
that exploit properties of floating point numbers, which turning on
-ffast-math
will break. I suggest placing these algorithms in
a separate file and linking them in.
The takeaway is not that -ffast-math
is always more accurate,
but that not using -ffast-math
gives a false sense of security.
If you don't do the error analysis, you're not safe. If you're not safe,
you may as well be fast.