Skip to content

ISHMEM on Aurora: Unit test wait_until_all-on_queue-2 hanging #15

@colleeneb

Description

@colleeneb

When running on Aurora with oneapi/public/2025.3.0 results in a test failure which hangs.

Build info:

   git clone --recurse-submodules https://github.com/Sandia-OpenSHMEM/SOS.git SOS
    cd SOS
    ./autogen.sh
    CC=icx CXX=icpx ./configure --prefix=$PWD/install_sos --with-ofi=/opt/cray/libfabric/1.22.0 --enable-pmi-simple --enable-ofi-mr=basic --disable-ofi-inject --enable-ofi-hmem --disable-bounce-buffers --enable-ofi-m\
anual-progress --enable-mr-endpoint --disable-nonfetch-amo --enable-manual-progress 2>&1 | tee SOS_config.log
    make -j 2>&1 | tee SOS_build.log
    make install
    cd ../

    export LD_LIBRARY_PATH=$PWD/SOS/install_sos/lib/:$LD_LIBRARY_PATH
    export LIBRARY_PATH=$PWD/SOS/install_sos/lib/:$LIBRARY_PATH
    export PATH=$PWD/SOS/install_sos/bin/:$PATH
    export CPATH=$PWD/SOS/install_sos/include:$CPATH

    git clone https://github.com/oneapi-src/ishmem.git
    cd ishmem
    mkdir -p build_sos
    cd build_sos
    CC=icx CXX=icpx cmake .. -DENABLE_OPENSHMEM=ON -DSHMEM_DIR=$PWD/../../SOS/install_sos -DCMAKE_INSTALL_PREFIX=$PWD/../../SOS/install_sos_ishmem -DBUILD_UNIT_TESTS=ON -DBUILD_PERF_TESTS=ON -DBUILD_APPS=ON -DCTEST_L\
AUNCHER=mpi 2>&1 | tee ISHMEM_config_sos.log
    make -j 2>&1 | tee ISHMEM_build.log
   ctest --test-dir ./test/unit --verbose --timeout 300 --no-tests=error |& tee -a cmake_tests.log

Using these envs:

    export FI_CXI_OPTIMIZED_MRS=0
    export ISHMEM_RUNTIME=OPENSHMEM
    export SHMEM_OFI_PROVIDER="cxi"
    export EnableImplicitScaling=0
    export export NEOReadDebugKeys=1

Error from ctest:

The following tests FAILED:
        161 - wait_until_all-on_queue-2 (Timeout)
Errors while running CTest

From a backtrace it looks like it hangs here:

Thread 1.1 (Thread 0x148098dc4f80 (LWP 85264) "wait_until_all"):
#0  0x0000148096903cae in ?? () from /usr/lib64/libze_intel_gpu.so.1
#1  0x0000148096936609 in ?? () from /usr/lib64/libze_intel_gpu.so.1
#2  0x0000148096934af0 in ?? () from /usr/lib64/libze_intel_gpu.so.1
#3  0x00001480969440e2 in ?? () from /usr/lib64/libze_intel_gpu.so.1
#4  0x0000148096945d27 in ?? () from /usr/lib64/libze_intel_gpu.so.1
#5  0x000014809655e4de in ?? () from /usr/lib64/libze_intel_gpu.so.1
#6  0x000000000048197e in ishmemi_usm_free(void*) ()
#7  0x000000000048c991 in ishmemi_proxy_fini() ()
#8  0x0000000000439b56 in ishmem_finalize() ()
#9  0x0000000000423a6d in ishmem_tester::~ishmem_tester() ()
#10 0x000000000042366d in main ()

Note that with 2025.2 SDK it does not hang but with 2025.3 SDK it does. I see the behavior with 1146.31 and 1146.12, although it's not consistent -- maybe 50% of runs.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions