Skip to content

Conversation

@msimberg
Copy link
Collaborator

Just for testing builds, I don't know if this will work.

@msimberg
Copy link
Collaborator Author

cscs-ci run alps;system=beverin;uarch=mi200;uenv=prgenv-gnu:25.12

@msimberg
Copy link
Collaborator Author

hsa-amd-aqlprofile had missing compiler dependencies: spack/spack-packages#2532.

@msimberg
Copy link
Collaborator Author

cscs-ci run alps;system=beverin;uarch=mi200;uenv=prgenv-gnu:25.12

@msimberg
Copy link
Collaborator Author

cscs-ci run alps;system=beverin;uarch=mi200;uenv=prgenv-gnu:25.12

@msimberg
Copy link
Collaborator Author

ROCm is a menace: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/551234120955960/1440398897047560/-/jobs/12201564988#L3724. hipblaslt seems to be picking up amdclang++ from the system... needs further investigation.

@msimberg
Copy link
Collaborator Author

cscs-ci run alps;system=beverin;uarch=mi200;uenv=prgenv-gnu:25.12

@msimberg
Copy link
Collaborator Author

cscs-ci run alps;system=beverin;uarch=mi200;uenv=prgenv-gnu:25.12

@msimberg
Copy link
Collaborator Author

cscs-ci run alps;system=beverin;uarch=mi200;uenv=prgenv-gnu:25.12

@jgphpc
Copy link
Contributor

jgphpc commented Nov 28, 2025

Hi,
Could you add rocprofiler_sdk/package.py to the recipe ? needed for linaro...

@msimberg
Copy link
Collaborator Author

Hi, Could you add rocprofiler_sdk/package.py to the recipe ? needed for linaro...

I'll try it out, hopefully no bigger issues (though note that I have issues with other packages before this is usable unfortunately).

@msimberg
Copy link
Collaborator Author

msimberg commented Dec 3, 2025

spack/spack-packages#2287 to add rocm 7.1.0 is also currently open. It may help, it may make things worse... The PR description does mention a change to hipblaslt, which may change something.

@afzpatel
Copy link

afzpatel commented Dec 3, 2025

ROCm is a menace: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/551234120955960/1440398897047560/-/jobs/12201564988#L3724. hipblaslt seems to be picking up amdclang++ from the system... needs further investigation.

Yes, it looks like an older version of ROCm is installed on the system and it's choosing the incorrect version of amdclang++. I'll see if I can reproduce the issue and put in a fix.

@iomaganaris
Copy link
Collaborator

cscs-ci run alps;system=beverin;uarch=mi200;uenv=prgenv-gnu:25.12

@iomaganaris
Copy link
Collaborator

cscs-ci run alps;system=beverin;uarch=mi200;uenv=prgenv-gnu:25.12

@afzpatel
Copy link

ROCm is a menace: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/551234120955960/1440398897047560/-/jobs/12201564988#L3724. hipblaslt seems to be picking up amdclang++ from the system... needs further investigation.

Yes, it looks like an older version of ROCm is installed on the system and it's choosing the incorrect version of amdclang++. I'll see if I can reproduce the issue and put in a fix.

I was able to reproduce and fix the previous error with hipblaslt by setting Tensile_COMPILER in the recipe:

diff --git a/repos/spack_repo/builtin/packages/hipblaslt/package.py b/repos/spack_repo/builtin/packages/hipblaslt/package.py
index eba3fbc25b..e2337352ee 100644
--- a/repos/spack_repo/builtin/packages/hipblaslt/package.py
+++ b/repos/spack_repo/builtin/packages/hipblaslt/package.py
@@ -278,6 +278,11 @@ class Hipblaslt(CMakePackage):
                     "ROCROLLER_ASSEMBLER_PATH", f"{self.spec['llvm-amdgpu'].prefix}/bin/amdclang++"
                 )
             )
+            args.append(
+                self.define(
+                    "Tensile_COMPILER", f"{self.spec['llvm-amdgpu'].prefix}/bin/amdclang++"
+                )
+            )
         if self.spec.satisfies("@7.1:"):
             args.append(self.define("HIPBLASLT_ENABLE_CLIENT", self.run_tests))
         else:

@iomaganaris
Copy link
Collaborator

ROCm is a menace: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/551234120955960/1440398897047560/-/jobs/12201564988#L3724. hipblaslt seems to be picking up amdclang++ from the system... needs further investigation.

Yes, it looks like an older version of ROCm is installed on the system and it's choosing the incorrect version of amdclang++. I'll see if I can reproduce the issue and put in a fix.

I was able to reproduce and fix the previous error with hipblaslt by setting Tensile_COMPILER in the recipe:

diff --git a/repos/spack_repo/builtin/packages/hipblaslt/package.py b/repos/spack_repo/builtin/packages/hipblaslt/package.py
index eba3fbc25b..e2337352ee 100644
--- a/repos/spack_repo/builtin/packages/hipblaslt/package.py
+++ b/repos/spack_repo/builtin/packages/hipblaslt/package.py
@@ -278,6 +278,11 @@ class Hipblaslt(CMakePackage):
                     "ROCROLLER_ASSEMBLER_PATH", f"{self.spec['llvm-amdgpu'].prefix}/bin/amdclang++"
                 )
             )
+            args.append(
+                self.define(
+                    "Tensile_COMPILER", f"{self.spec['llvm-amdgpu'].prefix}/bin/amdclang++"
+                )
+            )
         if self.spec.satisfies("@7.1:"):
             args.append(self.define("HIPBLASLT_ENABLE_CLIENT", self.run_tests))
         else:

Good suggestion, thank you 👍 Any reason why this was not merged in develop?

@afzpatel
Copy link

ROCm is a menace: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/551234120955960/1440398897047560/-/jobs/12201564988#L3724. hipblaslt seems to be picking up amdclang++ from the system... needs further investigation.

Yes, it looks like an older version of ROCm is installed on the system and it's choosing the incorrect version of amdclang++. I'll see if I can reproduce the issue and put in a fix.

I was able to reproduce and fix the previous error with hipblaslt by setting Tensile_COMPILER in the recipe:

diff --git a/repos/spack_repo/builtin/packages/hipblaslt/package.py b/repos/spack_repo/builtin/packages/hipblaslt/package.py
index eba3fbc25b..e2337352ee 100644
--- a/repos/spack_repo/builtin/packages/hipblaslt/package.py
+++ b/repos/spack_repo/builtin/packages/hipblaslt/package.py
@@ -278,6 +278,11 @@ class Hipblaslt(CMakePackage):
                     "ROCROLLER_ASSEMBLER_PATH", f"{self.spec['llvm-amdgpu'].prefix}/bin/amdclang++"
                 )
             )
+            args.append(
+                self.define(
+                    "Tensile_COMPILER", f"{self.spec['llvm-amdgpu'].prefix}/bin/amdclang++"
+                )
+            )
         if self.spec.satisfies("@7.1:"):
             args.append(self.define("HIPBLASLT_ENABLE_CLIENT", self.run_tests))
         else:

Good suggestion, thank you 👍 Any reason why this was not merged in develop?

I was trying to get my 7.1.0 PR merged for the past couple weeks, I didn't want to add any additional changes because I thought it might delay it getting merged. We'll add the change with 7.1.1: spack/spack-packages#2782

@iomaganaris
Copy link
Collaborator

ROCm is a menace: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/551234120955960/1440398897047560/-/jobs/12201564988#L3724. hipblaslt seems to be picking up amdclang++ from the system... needs further investigation.

Yes, it looks like an older version of ROCm is installed on the system and it's choosing the incorrect version of amdclang++. I'll see if I can reproduce the issue and put in a fix.

I was able to reproduce and fix the previous error with hipblaslt by setting Tensile_COMPILER in the recipe:

diff --git a/repos/spack_repo/builtin/packages/hipblaslt/package.py b/repos/spack_repo/builtin/packages/hipblaslt/package.py
index eba3fbc25b..e2337352ee 100644
--- a/repos/spack_repo/builtin/packages/hipblaslt/package.py
+++ b/repos/spack_repo/builtin/packages/hipblaslt/package.py
@@ -278,6 +278,11 @@ class Hipblaslt(CMakePackage):
                     "ROCROLLER_ASSEMBLER_PATH", f"{self.spec['llvm-amdgpu'].prefix}/bin/amdclang++"
                 )
             )
+            args.append(
+                self.define(
+                    "Tensile_COMPILER", f"{self.spec['llvm-amdgpu'].prefix}/bin/amdclang++"
+                )
+            )
         if self.spec.satisfies("@7.1:"):
             args.append(self.define("HIPBLASLT_ENABLE_CLIENT", self.run_tests))
         else:

Good suggestion, thank you 👍 Any reason why this was not merged in develop?

I was trying to get my 7.1.0 PR merged for the past couple weeks, I didn't want to add any additional changes because I thought it might delay it getting merged. We'll add the change with 7.1.1: spack/spack-packages#2782

Great! Thank you very much for the clarification and the updates to the ROCm packages 👍

@iomaganaris
Copy link
Collaborator

cscs-ci run alps;system=beverin;uarch=mi200;uenv=prgenv-gnu:25.12

@iomaganaris
Copy link
Collaborator

cscs-ci run alps;system=beverin;uarch=mi200;uenv=prgenv-gnu:25.12

@iomaganaris
Copy link
Collaborator

cscs-ci run alps;system=beverin;uarch=mi200;uenv=prgenv-gnu:25.12

@iomaganaris
Copy link
Collaborator

cscs-ci run alps;system=beverin;uarch=mi200;uenv=prgenv-gnu:25.12

…with llvm-amdgpu to avoid rebuilding python packages
@iomaganaris
Copy link
Collaborator

cscs-ci run alps;system=beverin;uarch=mi200;uenv=prgenv-gnu:25.12

@simonpintarelli
Copy link
Member

simonpintarelli commented Dec 23, 2025

the gtl library of [email protected] is linked to libamdhip64.so.6. I think we need [email protected], which is now in alps-cluster-config. It will only work on reservation sles15sp6, as 8.1.33 requires the new glibc.

➜  lib git:(main) ✗ readelf -d libmpi_gtl_hsa.so                                                                                                                                           [25/12/23| 4:28PM]

Dynamic section at offset 0xa1da0 contains 28 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libamdhip64.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libhsa-runtime64.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 ...

UPDATE 🤦🏻 cray-mpich 8.1.33 and 9.0.1 are also linked to libamdhip64.so.6 :) We can still try patchelf and replace by so.7, not sure this is going to work..

@simonpintarelli
Copy link
Member

cscs-ci run alps;system=beverin;uarch=mi200;uenv=prgenv-gnu:25.12

@simonpintarelli
Copy link
Member

cscs-ci run alps;system=beverin;uarch=mi200;uenv=prgenv-gnu:25.12

@iomaganaris
Copy link
Collaborator

the gtl library of [email protected] is linked to libamdhip64.so.6. I think we need [email protected], which is now in alps-cluster-config. It will only work on reservation sles15sp6, as 8.1.33 requires the new glibc.

➜  lib git:(main) ✗ readelf -d libmpi_gtl_hsa.so                                                                                                                                           [25/12/23| 4:28PM]

Dynamic section at offset 0xa1da0 contains 28 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libamdhip64.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libhsa-runtime64.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 ...

UPDATE 🤦🏻 cray-mpich 8.1.33 and 9.0.1 are also linked to libamdhip64.so.6 :) We can still try patchelf and replace by so.7, not sure this is going to work..

I haven't looked into this issue but I saw that there's not only libamdhip64.so.6 coming from /opt/rocm-6.1.0/lib but also libamd_comgr.so.2:

[beverin][ioannmag@beverin-ln001 ICON4PY]$ ldd /user-environment/linux-zen3/cray-gtl-8.1.32-dykogaehen633o4fzsqlr5hhept7i7j3/lib/libmpi_gtl_hsa.so | grep "opt/rocm"
        libamdhip64.so.6 => /opt/rocm-6.1.0/lib/libamdhip64.so.6 (0x00007f98acf47000)
        libamd_comgr.so.2 => /opt/rocm-6.1.0/lib/libamd_comgr.so.2 (0x00007f98a464a000)
[beverin][ioannmag@beverin-ln001 ICON4PY]$ find /user-environment -name "libamdhip64.so*"
/user-environment/env/._default/ogqp57onxwlinpjjcp4rwqcypo5yqqtb/lib/libamdhip64.so
/user-environment/env/._default/ogqp57onxwlinpjjcp4rwqcypo5yqqtb/lib/libamdhip64.so.7
/user-environment/env/._default/ogqp57onxwlinpjjcp4rwqcypo5yqqtb/lib/libamdhip64.so.7.1.25505
/user-environment/linux-zen3/hip-7.1.0-hlbhwy6epcpcbymr3qpsu6x3nxr2fizf/lib/libamdhip64.so
/user-environment/linux-zen3/hip-7.1.0-hlbhwy6epcpcbymr3qpsu6x3nxr2fizf/lib/libamdhip64.so.7
/user-environment/linux-zen3/hip-7.1.0-hlbhwy6epcpcbymr3qpsu6x3nxr2fizf/lib/libamdhip64.so.7.1.25505
[beverin][ioannmag@beverin-ln001 ICON4PY]$ find /user-environment -name "libamd_comgr.so*"
/user-environment/linux-zen3/comgr-7.1.0-d6zrig6m7vdftmyb4g5onyts7p5vrahc/lib/libamd_comgr.so
/user-environment/linux-zen3/comgr-7.1.0-d6zrig6m7vdftmyb4g5onyts7p5vrahc/lib/libamd_comgr.so.3
/user-environment/linux-zen3/comgr-7.1.0-d6zrig6m7vdftmyb4g5onyts7p5vrahc/lib/libamd_comgr.so.3.0

I'm not really familiar with the cray-mpich package but maybe adding %c,cxx,[email protected] helps. I would try this out in a separate PR or locally though 🙈

@simonpintarelli
Copy link
Member

simonpintarelli commented Jan 7, 2026

@iomaganaris The problem is that the binary rpms for cray-mpich are linked against libamdhip.so.6, since there is no libamdhip.so.6 in the spack installation, it finds the libraries in /opt/rocm-6.1.0. I think we have to wait for HPE to release a new cray-mpich, which is built against rocm7.

A workaround would be to build without the gtl/hsa library (cray-mpich~rocm).

@msimberg
Copy link
Collaborator Author

msimberg commented Jan 8, 2026

BTW, it occurred to me now that if we want to avoid the issues with HPE's precompiled binaries, we could make an OpenMPI uenv instead with ROCm 7. I'd make it a separate uenv, but it might be a faster/smoother option at the moment instead of trying to patch up HPE's binaries. What do you think? If ROCm 7 is otherwise building ok now I'd expect changing to OpenMPI to be relatively simple. I can try to set that up if you think it's useful. I'm anyway about to deploy #263.

@simonpintarelli
Copy link
Member

@msimberg I built q-e-sirius with openmpi+rocm7:
https://github.com/eth-cscs/alps-uenv/pull/288/files

I'm not sure how to run it correctly:

srun --mpi=pmix --jobid=205659 --overlap -n2 osu_bw D D                                                                                                    
--------------------------------------------------------------------------                                                                                                                                    
A requested component was not found, or was unable to be opened.  This                                                                                                                                        
means that this component is either not installed or is unable to be                                                                                                                                          
used on your system (e.g., sometimes this means that shared libraries                                                                                                                                         
that the component requires are unable to be found/loaded).  Note that                                                                                                                                        
PMIx stopped checking at the first component that it did not find.                                     

Host:      nid002920                                                                                                                                                                                          
Framework: psec                                                                                                                                                                                               
Component: munge  

--mpi=pmi2 worked, but still gave a warning:

[beverin][simonpi@nid002920 osu-micro-benchmarks]$ srun --mpi=pmi2 --jobid=205659 --overlap -n2 osu_bw D D
No PMIx server was reachable, but a PMI1/2 was detected.
If srun is being used to launch application,  2 singletons will be started.

@msimberg
Copy link
Collaborator Author

msimberg commented Jan 8, 2026

@simonpintarelli have a look here: https://docs.tds.cscs.ch/301/software/communication/openmpi/#uenv (not merged yet). The munge warning should be harmless if it ran otherwise (but I'm pretty sure you need to set the other variables mentioned on that page). That said, pmix might be set up differently on beverin and I haven't tested there.

CXI should be the safe choice. LNX is faster if it works, but it may not work...

@msimberg
Copy link
Collaborator Author

msimberg commented Jan 8, 2026

I just did a quick test with q-e-sirius/v1.0.2-rocm7:2235068710 and osu_bw works with CXI, but it fails immediately with LNX.

@iomaganaris
Copy link
Collaborator

cscs-ci run alps;system=beverin;uarch=mi200;uenv=prgenv-gnu:25.12

@iomaganaris
Copy link
Collaborator

cscs-ci run alps;system=beverin;uarch=mi200;uenv=prgenv-gnu:25.12

1 similar comment
@iomaganaris
Copy link
Collaborator

cscs-ci run alps;system=beverin;uarch=mi200;uenv=prgenv-gnu:25.12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants