Sensitivity to C math library and mingw-w64 v12

C math library functions, such as exp or sin, are heavily used by R and packages. The C standard doesn’t require these functions to be “precise”. Instead, there is room for performance optimizations causing a reasonable amount of inaccuracy. The results differ between platforms and may change even on a single platform. This happened with mingw-w64 v12, the SDK/runtime used by GCC and LLVM on Windows, which reduced accuracy of many math functions (for performance and maintenance reasons).

Around 20 R CRAN packages ended up failing their tests as a result of this and as these issues are hard to debug, Rtools45 ended up using an older version of mingw-w64. Rtools will have to switch to current mingw-w64 eventually, but staying with v11 for now should allow time to resolve the issues.

This text reports on the initial work on debugging these issues and presents several types of problems encountered in packages. While this work is on Windows, the problems in packages are general and potentially affect all platforms: some packages are clearly too sensitive to even small inaccuracies in math library functions.

Debugging numerical issues

This is not the first time I’ve ran into R packages failing their tests due to unexpected numerical results. Debugging these issues is much harder than good old correctness bugs, where one can relatively easily tell what is correct and what is wrong, and accordingly narrow down the test to a case that is small enough to find the cause and to decide which software component is to blame and needs to be fixed.

With the numerical issues, I often end up being told by the package authors that the compiler (or math library, etc) is wrong and that I should report it as a bug. But such a bug report would have to come with solid argumentation that the behavior is wrong, ideally that it violates a standard the code should follow.

Neither the C standard nor any other linked standard require “precise” (correctly rounded) results of math functions. They don’t even require any accuracy limits for them. Therefore, it is not realistic one would succeed reporting inaccuracies to the provider of math functions, unless the inaccuracies are very big (and this observation is based on experience, discussions with mingw-w64 developers).

Reporting a small inaccuracy as a “non-reproducibility” issue is not working, either: outside the R community, the priorities are set differently and performance improvements within what is allowed by the standards are welcome. From a testing (computer science) viewpoint, it may be actually beneficial if systems change in ways that are allowed within the standards, because it increases the chances true bugs are found in the software.

The common accuracy metric for the math library is ULP (“units in the last place”, “units of least precision”), and one usually looks for the maximum difference from the correctly rounded results across the whole domain of a function. The general understanding is that several ULPs difference is fine, but some functions (and perhaps in some domains) are more expensive to get right than other, so the expectations may also depend on the function. [Gladman et al, 2025] has more details, references and measurements.

If a big inaccuracy was found in some math function, one could report it, but indeed one needs to know which function, how inaccurate, and for which inputs. So, the first goal of my debugging was to find out which math functions cause failures of which packages, and how inaccurate these functions really are for the inputs used.

Mingw-w64 provides a static library, mingwex, with internal implementations of some math functions. The remaining ones are used from UCRT, the Windows C runtime. Mingw-w64 v12 simply reduced the amount of functions with internal implementations, about 90 functions were switched to UCRT.

While testing a release candidate of Rtools45, which were to use mingw-w64 v12, 21 CRAN packages were identified as failing their tests because of mingw-w64 v12, and it seemed likely that most of these because of numerical differences (from the outputs of the failed tests, and also from the commit logs of mingw-w64 v12). These 21 CRAN packages required 430 dependent packages (from CRAN and Bioconductor) with a total library size of about 1G when installed.

I’ve made some initial experiments manually, using debuggers, profilers, reading the code of the failing tests. Also I’ve built R and packages with fdlibm, which provides more accurate implementations of some of the functions. I’ve built R selectively with some math functions from fdlibm, then also with some math functions from mingwex. This way I was able to debug some of the packages, but it was taking way too long.

To avoid frequent re-building of that 1G of code and simplify the debugging, I’ve wrapped the math functions (ones switched in v12 to UCRT) by a dynamic library and made all of R and packages use that library. I’ve then experimented with different implementations of that dynamic library, switching between math function implementations and reporting differences, sometimes making custom changes to the dynamic library e.g. to enter a debugger on some special input value.

So far I’ve been able to confirm that 14 of those 21 packages were failing because of the new math functions in v12 and almost always because of their reduced accuracy. On the other hand, looking into the causes revealed that the inaccuracies involved were never above 1 ULP. In 13 cases, the problems were in the packages (not in the library). One case is debatable (perhaps in the compiler, more below).

The rest of the text covers the problems found in packages. I’m using simplified code examples/pseudocode and intentionally not naming the individual packages - that could have been be unfair given that likely many other have the same problems, but just haven’t been discovered by this exercise. Also, the original code would be too long for presentation.

Requiring an exact numerical result

One of the packages made an assumption that could be narrowed down to:

x1 <- -log(-expm1(-0.5))
x2 <- -log1p(-exp(-x1))/0.5
x2 == 1

This used to hold with mingw-w64 v11, but no longer with v12 (due to reduced accuracy of expm1). The difference was in the very last bit:

> sprintf("%a", x2)  # v11
[1] "0x1p+0"
>  sprintf("%a", x2) # v12
[1] "0x1.0000000000001p+0"

And the error in the involved expm1 result is 1 ULP. The correct result can be computed using mpfr library, from R using package Rmpfr.

The author of the package knows very well that checks for numerical results should come with a reasonable tolerance limit, but has just forgotten in this case, and the package used to be working for many years on three different platforms. This change in mingw-w64 helped to find this bug.

While this input to expm1 lead to 1 ULP error, by random search I found input for which it was 3 ULP. [Gladman et al, 2025] reports 3.06. I learned this is still something regarded as acceptable by mingw-w64 developers. The paper reports inaccuracies also of other UCRT functions (my understanding is that “MSVC” in the paper boils down to UCRT).

Another package used tolerance limits for comparisons of equality, but forgot to use them for less-than/greater-than comparisons of computed results, which has lead to a test failure (due to differences up to 1 ULP in a math function).

Requiring a too precise result

Using a too low tolerance in tests was a common problem and not necessarily surprising. A number of packages were failing tests due to unexpected numerical results, caused by changes in atan2, exp, log, pow, sin and sqrt. But, all these changes compared to v11 were within 1 ULP.

Such problem could be caused either by an unrealistically narrow tolerance limit (the tests should be updated), or by a numerically unsound algorithm/implementation (the code is wrong and should be fixed).

At least two of the packages provided different tests (different expectations and/or tolerances) for different platforms, apparently based on what has been observed on those platforms (e.g. differentiating Windows, macOS, Linux, once even looking at which Linux distribution). I think that instead one should have a single set of expectations and tolerances, which should be used on all platforms. If this makes the testing too weak (the limits are too wide), perhaps the algorithms should be re-visited. Or, put the other way, testing on different platforms is a kind of sensitivity analysis that reveals a lower bound on realistic tolerances.

One package allowed very different results on different platforms - not just differences in some low-order digits, but one test have accepted that the same computation on one platform gave ~600 and on another ~1200. A sign of that the algorithm/implementation is wrong.

Branching based on numerical result

One package included code with conditions based on comparisons of already computed numerical result. While the package carefully rounded final results before comparing them to expected values to allow for some implementation-specific differences, the branches were simple/exact comparisons. The code with this problem could look like

  x <- sin(y)
  if (x > 0.543) z <- 12.123
  else z <- 157.51

Due to small differences, again within 1 ULP from mingw-w64 v11, different branches were takes with v12, providing results that did not satisfy the tests.

In some cases, but not all, the differences were an NA (as a default value) vs NaN as a computed value. One should be aware that NA could become NaN in computations (more in ?is.nan) and be prepared for that in the tests. But, the problem here seems bigger as some differences involved finite values: the algorithm will have to be adjusted.

Failures due to constant folding

This is perhaps the most interesting and surprising problem I’ve ran into, and one where it seems questionable whether the problem is actually in the package. And one that I’ve debugged fully, having to read the x86_64 disassembly of the involved code.

Imagine code like:

double f(double x, double y)
{
  if (x == y)
    z = sqrt(sin(x) - sin(y));
  else
    z = 0;
  return z;
}

The code was indeed more complicated and it took long to debug, but the core problem is visible above. One would expect the result would be 0, but it was NaN.

Function sin() in mingw-w64 v12 is now less precise than it was, but for the argument observed, the difference was within 1 ULP. The function was part of a program where the optimizer was able to figure out the value of y, but not the value of x. At runtime, the values were exactly identical, so the branch computing sqrt() was taken. The value of sin(x) is computed at runtime using UCRT, but as the value of y was known at compile time, the compiler also computed (constant-folded) the value of sin(y).

The compiler (GCC, which we use on Windows/x86_64 in Rtools) used a correctly rounded result it computed using mpfr library. But, the UCRT implementation is less precise and the result happened to be smaller, so the argument of sqrt() became negative, and hence NaN was returned. In the package, this has later lead to a runtime exception being thrown.

Constant-folding using correctly-rounded values makes it possible to constant-fold also when cross-compiling, and hence also to generate the same code when cross-compiling and when compiling natively. Also, even when compiling natively, it could be that the version of the math library would be different from the time when executing (and we would have the same problem). Some also say that programmers probably wouldn’t object to getting more precise values from the compiler than from the runtime library, in principle, but, in this case it matters. One can disable constant folding of the sin function (-fno-builtin-sin), but, how can one know in advance. In principle, one could also disable constant folding of all builtin functions. See the LLVM discussion for more on the difficulties with constant-folding of math functions.

It may not help in all cases, but sometimes a useful approach is to call a function and then diagnose the result, rather than upfront checking the arguments and then assuming the function will succeed. This is not necessarily only the case with math functions, but it can avoid race conditions with some other, say file system operations.

Depending on unspecified behavior

It turned out not to be a numerical problem, but still involved a change in a math function. Several packages ran into seemingly infinite computation with v12. It turned out this was because of a single dependency, a package that included a loop similar to this one:

uint64_t rnd() {
  uint64_t res = 0;
  while (res <= 2) {
    double u = uniform() * UINT64_MAX;
    res = lround(u);
  }
  return res;
}

In the code above, one problem is that math function lround() gives a long value, which is 32-bit on Windows. When the value doesn’t fit, the result is unspecified. So, a double value between 0 and 1 is multiplied by the maximum for unsigned 64-bit integer. When that doesn’t fit to a signed 32-bit integer, which is common, res gets the unspecified value. With v11, the unspecified value happened to be LONG_MIN (which came from the Intel hardware, not even controlled by the internal lround implementation), so res ended up something very large, yet not very random, and the loop finished. With v12, the value returned from lround happened to be 0 (probably controlled by UCRT), so the loop kept running.

Summary

This is still work in progress. Some package failures with mingw-w64 v12 haven’t been analyzed, yet, and some of those may still be caused by numerical differences. Some cases I’ve analyzed weren’t yet confirmed/dealt with by the package authors.

The packages that were analyzed were reported to the authors, with patches making the “easy” fixes or work-arounds, such as extending tolerance limits. Some of the package authors have already fixed their packages, but some would have to spend more time when it needs algorithmic improvements.

Some of the issues were reported to mingw-w64, and I’ve used advice from Martin Storsjo and other mingw-w64 developers in this work.

So far my interpretation is that this work revealed an important problem in R packages: that (some of them) are overly sensitive to even very small variations in inputs, such as (surprisingly) even the math library. And it would be worth pursuing methods and tools to help detecting this more widely - after all, it is true that many tolerances in tests are unreasonable, but the question is how to choose reasonable tolerances. At least some fuzzing of the inputs should be possible and perhaps could be implemented in later stages of this work.

R packages in CRAN and Bioconductor are used as a test suite for changes in R itself and in Rtools. The current practice is that such changes are postponed or given up on if they break too many packages. When they only break a little, the people making these changes (to R, Rtools) provide advice or even patches to package authors to fix their packages.

This requires also debugging all problems found by package checks: without debugging those, one cannot know if the problem is in some of the packages (and which) or if in R, or in Rtools (hence say in the compiler, binutils, some library). It is not an exception when a bug is woken up by an innocuous change somewhere else, which can further extend the scope of the debugging.

This practice makes the R ecosystem much more stable than it would have been otherwise. It helped to find real bugs in external software, including in GCC, which have been reported and fixed upstream. On the other hand, given this practice, problems in tests in packages, where they accidentally or even intentionally capture unspecified behavior (including tests using too narrow tolerances) complicate the maintenance of R itself. In this case, it prevented an upgrade of a core Rtools component.