home rss

Why I turn on ffast-math

In this blog post I hope to convince you to turn on -ffast-math by default and introduce you to the exciting field of error analysis.

The flag -ffast-math for C-compilers GCC and Clang allows the compiler to break compliance with the IEEE-754 standard. This standard describes the rules of floating point computation. For example a+b+c+da + b + c + d must be interpreted as ((ab)c)d((a \oplus b) \oplus c) \oplus d, where \oplus is the floating point approximation of addition. Ignoring this requirement lets us reorder the computation to (ab)(cd)(a \oplus b) \oplus (c \oplus d). The computations aba \oplus b and cdc \oplus d can be done in parallel, which can massively improve the performance of your code. For example a factor 88 for SIMD parallelism, a factor 22 for ILP, and a factor 4488 for multithreading.

It is a logical first thought that this performance improvement must come at the cost of correctness, but fortunately this worry is unfounded. To have a meaningful discussion on correctness, we first have to define what it means to be correct. Clearly, we do not mean bitwise equality. After all, floating point numbers are approximations of real numbers, so equality is too much to ask. Instead, we should look for error bounds. The relative error between xx and an approximation x̂\hat{x} is the scaled difference |(xx̂)/x||(x - \hat{x}) / x| or equivalently |δ||\delta| for x̂=(1+δ)x\hat{x} = (1 + \delta)x.

Error of Summation

If we compute the relative error for summing 10910^9 random numbers in [0,1)[0, 1), (sum.c) in a standard-compliant way we get a huge relative error


$ gcc -O3 sum.c 
$ ./a.out 1000000000
Relative error 9.664454e-01

Using -ffast-math the compiler will generate SIMD instructions. This is faster, and more accurate.


$ gcc -O3 -ffast-math -mavx2 -mfma sum.c
$ ./a.out 1000000000
Relative error 7.315683e-01

We can explain this as follows. For floating point addition we have ab=(1+δ)(a+b),|δ|<224a \oplus b = (1 + \delta)(a + b), \quad |\delta| < 2^{-24}, so the error in an update sum = sum \oplus x[i] is proportional to the variable sum we accumulate the partial sums in. If we let GCC use SIMD instructions, it accumulates partial sums in four variables, corresponding to the SIMD lanes, and then sums these accumulators at the end: (x[0]x[4])(x[1]x[5])(x[2]x[6])(x[3]x[7])(x[0] \oplus x[4] \oplus \cdots ) \oplus (x[1] \oplus x[5] \oplus \cdots ) \oplus (x[2] \oplus x[6] \oplus \cdots ) \oplus (x[3] \oplus x[7] \oplus \cdots ) . We now only do four addition on an accumulator larger than a quarter of the sum, instead of 34n\frac{3}{4}n .

We can break associativity even further by using multithreading, which gives a far larger jump in accuracy.


$ gcc -O3 -ffast-math -mavx2 -mfma -fopenmp sum.c
$ ./a.out 1000000000
Relative error 1.791976e-06

A classical worst-case error analysis assumes consecutive rounding errors add up. For the previous examples this was the case because the additions are rounded down when 𝚜𝚞𝚖x[i]\texttt{sum} \gg x[i] , giving all δ\delta the same negative sign. For this multithreaded program however, the accumulators never grow so large and the rounding is much more random. This gives rounding errors the chance to cancel each other out.

Always Better?

Breaking IEEE this way reduces the worst-case error bound. That does not mean it is better on all input. As a counterexample to the SIMD sum, consider the repeating sequence 0.5,1.0,1.5,0.0,0.5,1.0,1.5,0.0,0.5, 1.0, -1.5, 0.0, 0.5, 1.0, -1.5, 0.0, \cdots . Summing this left to right yields the exact sum 00 , while the SIMD lanes will sum nn values to max(n/2,223),max(n,2241),max(1.5n,225),0.\max(n / 2, 2^{23}), \max(n, 2^{24} - 1), -\max(1.5 n, 2^{25}), 0. We will get a wrong result for nn large enough. I will leave it as an exercise to the reader to come up with a counterexample that is actually realistic. (Maths-speak for saying I cannot do it.)

There are also algorithms such as compensated summation that exploit properties of floating point numbers, which turning on -ffast-math will break. I suggest placing these algorithms in a separate file and linking them in.

Conclusion

The takeaway is not that -ffast-math is always more accurate, but that not using -ffast-math gives a false sense of security. If you don't do the error analysis, you're not safe. If you're not safe, you may as well be fast.

Further Reading

(As you may be able to tell, I am a big fan of Nicholas Higham's writing.)