Skip to content

Conversation

@mathomp4
Copy link
Contributor

@mathomp4 mathomp4 commented Nov 4, 2025

TL;DR: This update avoids an issue between icpx when ESMF is built with ESMF_BOPT=O and ifx when code using this is built with -fpe0.

See GEOS-ESM/MAPL#4159 and GEOS-ESM/MAPL#4162

Crash in MAPL

This update is needed with an issue found by MAPL unit tests when trying out ifx 2025.3. The cause was if we build ESMF with ESMF_BOPT=O but then build MAPL with ifx with our Debug flags which have -fpe0. When we did this, we got a crash:

forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source
libpthread-2.28.s  000014C8D9A4B990  Unknown               Unknown  Unknown
libesmf.so         000014C8EFFA4704  _ZN5ESMCI4BBoxC1E     Unknown  Unknown
libesmf.so         000014C8F044FED0  _ZN5ESMCI16OctSea     Unknown  Unknown
libesmf.so         000014C8F0452032  _ZN5ESMCI9OctSear     Unknown  Unknown
libesmf.so         000014C8F01FE865  _ZN5ESMCI6InterpC     Unknown  Unknown
libesmf.so         000014C8F03C06E8  _ZN5ESMCI6regridE     Unknown  Unknown
libesmf.so         000014C8F03B9527  _Z19ESMCI_regrid_     Unknown  Unknown
libesmf.so         000014C8F033C753  _ZN5ESMCI7MeshCap     Unknown  Unknown
libesmf.so         000014C8F0431FE2  c_esmc_regrid_cre     Unknown  Unknown
libesmf.so         000014C8F0AC829A  esmf_regridmod_mp     Unknown  Unknown
libesmf.so         000014C8F08316FD  esmf_fieldregridm     Unknown  Unknown
libMAPL.base.so    000014C8EB4ED77D  create_route_hand        1534  MAPL_EsmfRegridder.F90
libMAPL.base.so    000014C8EB4E8633  initialize_subcla        1382  MAPL_EsmfRegridder.F90
libMAPL.base.so    000014C8EAF47EDB  initialize_base           976  MAPL_AbstractRegridder.F90
libMAPL.base.so    000014C8EACC64D1  make_regridder_            96  NewRegridderManager.F90
libMAPL.griddedio  000014C8EBE85103  createfilemetadat         156  GriddedIO.F90
libMAPL.history.s  000014C8EE89D43E  initialize               2622  MAPL_HistoryGridComp.F90

where the call ends on an ESMF_FieldRegridStore.

Pure ESMF Reproducer

@bena-nasa then created a reproducer that was outside of MAPL and showed the same crash. With that reproducer, I was able to run with DDT and find that the offending code was:

    double ns = std::sqrt(nr[0]*nr[0]+nr[1]*nr[1]+nr[2]*nr[2]);


    // Only normalize if bigger than 0.0
    if (ns >1.0E-19) {
       nr[0] /= ns; nr[1] /= ns; nr[2] /= ns;
    } else {
      nr[0] =0.0; nr[1] =0.0; nr[2] =0.0;
    }

from

double ns = std::sqrt(nr[0]*nr[0]+nr[1]*nr[1]+nr[2]*nr[2]);
// Only normalize if bigger than 0.0
if (ns >1.0E-19) {
nr[0] /= ns; nr[1] /= ns; nr[2] /= ns;
} else {
nr[0] =0.0; nr[1] =0.0; nr[2] =0.0;
}

I'm guessing it is the divides in the if block.

Remove ESMF

I then asked ye olde Chat GPT for a small Fortran + C++ code that would exercise that chunk, and it seems to trigger it:

> ./doit.sh
+ rm -f test_norm normalize.o
+ icpx -O2 -g -traceback -c normalize.cpp
+ ifx -O0 -g -traceback -fpe0 -o test_norm test_norm.F90 normalize.o
+ ./test_norm
 Case 1: zeros -> expect (0,0,0)
forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source
libpthread-2.28.s  0000146509148990  Unknown               Unknown  Unknown
test_norm          0000000000404791  normalize3                 11  normalize.cpp
test_norm          0000000000404387  test_norm                  17  test_norm.F90
test_norm          00000000004042AD  Unknown               Unknown  Unknown
libc-2.28.so       0000146508B957E5  __libc_start_main     Unknown  Unknown
test_norm          00000000004041CE  Unknown               Unknown  Unknown
./doit.sh: line 16: 2968692 Aborted                 (core dumped) ./test_norm

Chat GPT does have some thoughts about this code:

Nice repro—this is exactly what I suspected: at -O2 the C++ compiler is hoisting the reciprocal (1.0/ns) (or otherwise speculating the divide) before it proves the branch, so when ns==0 you trip the -fpe0 trap even though the source looks guarded.

Why this happens

icpx (LLVM) assumes, by default, that FP ops don’t trap; with that assumption it’s free to move/transform divides (e.g., form a reciprocal early) across control flow. Your Fortran -fpe0 makes a divide-by-zero trap, but the C++ optimizer doesn’t know you need strict exception semantics unless you tell it.

It suggested either extra flags when compiling with icpx:

  -fp-model precise -ftrapping-math -fno-reciprocal-math

or changing the code:

    // (Optional but numerically nicer than sqrt of squares)
    const double ns = std::hypot(std::hypot(nr[0], nr[1]), nr[2]);


    // Only normalize if bigger than 0.0
    if (ns > 1.0e-19) {
        // Prevent speculative/hoisted reciprocal without vendor pragmas/flags
        volatile double denom = ns;          // portable optimization barrier
        const double inv = 1.0 / denom;      // executed only in the guarded block
        nr[0] *= inv; nr[1] *= inv; nr[2] *= inv;
    } else {
        nr[0] = 0.0; nr[1] = 0.0; nr[2] = 0.0;
    }

for doing the code bit.

I tested this and it does "fix" our issue And this is "nicer" code in that it removes three divides for one single one, etc. but the fear is this could be non-zero-diff.

But in talking with @oehmke, we decided to try a simpler test which is what this PR is where we go from:

    double ns = std::sqrt(nr[0]*nr[0]+nr[1]*nr[1]+nr[2]*nr[2]);

to:

    volatile double ns = std::sqrt(nr[0]*nr[0]+nr[1]*nr[1]+nr[2]*nr[2]);

which seems to prevent the compiler from hoisting the code.

This change seems to fix our issues in MAPL.

I'll do a build with, say, our older production ifort 2021.13 to see if we get zero-diff with this code change. I think we should, but...

This update avoids an issue between `icpx` when ESMF is built with `ESMF_BOPT=O` and `ifx` when code *using* this is built with `-fpe0`.
@mathomp4
Copy link
Contributor Author

mathomp4 commented Nov 5, 2025

I built this change in Baselibs with ifort 2021.13 (our production code) and then did a run of GEOS with and without. The runs were zero-diff.

@theurich
Copy link
Member

theurich commented Nov 6, 2025

I know the volatile work-around suggested here is very similar to what we did under #472 for the same problem with -fpe0 and IFX. However, I don't know how much further we want to push this approach!

Under #472 we were dealing with a section of the ESMF library that is just concerned with high level bookkeeping and labeling - not a performance critical part. However, I am concerned that what is proposed in #502 is actually touching performance critical portions of ESMF. I don't think we want to start adding volatile everywhere to prevent speculative execution, and effectively disable a critical performance technique. Especially in sections where we want performance. In my mind, if the user code really wants to test with -fpe0, then it should use an un-optimized build of ESMF.

@theurich theurich self-requested a review November 6, 2025 16:31
Copy link
Member

@theurich theurich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes themselves look okay. However, I am concerned about performance impact as discussed in my longer comment.

@mathomp4
Copy link
Contributor Author

mathomp4 commented Nov 6, 2025

@theurich Well there is this:

    // (Optional but numerically nicer than sqrt of squares)
    const double ns = std::hypot(std::hypot(nr[0], nr[1]), nr[2]);


    // Only normalize if bigger than 0.0
    if (ns > 1.0e-19) {
        // Prevent speculative/hoisted reciprocal without vendor pragmas/flags
        volatile double denom = ns;          // portable optimization barrier
        const double inv = 1.0 / denom;      // executed only in the guarded block
        nr[0] *= inv; nr[1] *= inv; nr[2] *= inv;
    } else {
        nr[0] = 0.0; nr[1] = 0.0; nr[2] = 0.0;
    }

which works just as well, though it might be non-zero-diff. But, since that (might?) replace 3 divides by one, it could be more performant (though maybe the

But if you insist on not using this, then I can look at trying to add code into MAPL's CMake to prevent the use of -fpe0 in when we build with Debug flags. We'd probably have to do that otherwise we'd need to double our builds of ESMF every time since we know that ESMF_BOPT=g crushes regridding performance.

@oehmke
Copy link
Contributor

oehmke commented Nov 6, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants