-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
When running on Aurora with oneapi/public/2025.3.0 results in a test failure which hangs.
Build info:
git clone --recurse-submodules https://github.com/Sandia-OpenSHMEM/SOS.git SOS
cd SOS
./autogen.sh
CC=icx CXX=icpx ./configure --prefix=$PWD/install_sos --with-ofi=/opt/cray/libfabric/1.22.0 --enable-pmi-simple --enable-ofi-mr=basic --disable-ofi-inject --enable-ofi-hmem --disable-bounce-buffers --enable-ofi-m\
anual-progress --enable-mr-endpoint --disable-nonfetch-amo --enable-manual-progress 2>&1 | tee SOS_config.log
make -j 2>&1 | tee SOS_build.log
make install
cd ../
export LD_LIBRARY_PATH=$PWD/SOS/install_sos/lib/:$LD_LIBRARY_PATH
export LIBRARY_PATH=$PWD/SOS/install_sos/lib/:$LIBRARY_PATH
export PATH=$PWD/SOS/install_sos/bin/:$PATH
export CPATH=$PWD/SOS/install_sos/include:$CPATH
git clone https://github.com/oneapi-src/ishmem.git
cd ishmem
mkdir -p build_sos
cd build_sos
CC=icx CXX=icpx cmake .. -DENABLE_OPENSHMEM=ON -DSHMEM_DIR=$PWD/../../SOS/install_sos -DCMAKE_INSTALL_PREFIX=$PWD/../../SOS/install_sos_ishmem -DBUILD_UNIT_TESTS=ON -DBUILD_PERF_TESTS=ON -DBUILD_APPS=ON -DCTEST_L\
AUNCHER=mpi 2>&1 | tee ISHMEM_config_sos.log
make -j 2>&1 | tee ISHMEM_build.log
ctest --test-dir ./test/unit --verbose --timeout 300 --no-tests=error |& tee -a cmake_tests.logUsing these envs:
export FI_CXI_OPTIMIZED_MRS=0
export ISHMEM_RUNTIME=OPENSHMEM
export SHMEM_OFI_PROVIDER="cxi"
export EnableImplicitScaling=0
export export NEOReadDebugKeys=1Error from ctest:
The following tests FAILED:
161 - wait_until_all-on_queue-2 (Timeout)
Errors while running CTest
From a backtrace it looks like it hangs here:
Thread 1.1 (Thread 0x148098dc4f80 (LWP 85264) "wait_until_all"):
#0 0x0000148096903cae in ?? () from /usr/lib64/libze_intel_gpu.so.1
#1 0x0000148096936609 in ?? () from /usr/lib64/libze_intel_gpu.so.1
#2 0x0000148096934af0 in ?? () from /usr/lib64/libze_intel_gpu.so.1
#3 0x00001480969440e2 in ?? () from /usr/lib64/libze_intel_gpu.so.1
#4 0x0000148096945d27 in ?? () from /usr/lib64/libze_intel_gpu.so.1
#5 0x000014809655e4de in ?? () from /usr/lib64/libze_intel_gpu.so.1
#6 0x000000000048197e in ishmemi_usm_free(void*) ()
#7 0x000000000048c991 in ishmemi_proxy_fini() ()
#8 0x0000000000439b56 in ishmem_finalize() ()
#9 0x0000000000423a6d in ishmem_tester::~ishmem_tester() ()
#10 0x000000000042366d in main ()
Note that with 2025.2 SDK it does not hang but with 2025.3 SDK it does. I see the behavior with 1146.31 and 1146.12, although it's not consistent -- maybe 50% of runs.
Thanks!
Metadata
Metadata
Assignees
Labels
No labels