Optimize HPL and MPICH further #70

volyrique · 2025-09-16T00:27:45Z

HPL spends the majority of its execution time inside the BLAS implementation, so these changes affect mainly slower processors such as in-order ones.

Note that link-time optimization (i.e. the -flto option) increases both build time and memory consumption during linking by a non-trivial amount. As a reminder, the -march=native parameter behaves differently on AArch64 and x86-64, for example, especially with older compilers such as GCC up to and including version 14, so ideally we would combine it with -mtune=native as futureproofing. However, in my experiments it didn't lead to any further significant performance difference, while all 20 runs that I tried failed the residual check, so I decided to omit it.

I benchmarked my changes on a Radxa Orion O6 board by doing 20 runs with the Qs parameter set to 12 and blis_configure_options - to cortexa57. Here are my results:

Revision	Successful runs	Median Gflops	Standard error	Minimum Gflops	Maximum Gflops
`2c2d455`	7	88.03	0.22	87.15	89.02
My changes	10	89.52	0.20	88.46	90.49

In other words, an approximately 1.69% improvement. For comparison, on an AMD Ryzen 9 5900X-based machine with 64 GiB RAM there was no significant difference.

templates/benchmark-Make.top500.j2

HPL spends the majority of its execution time inside the BLAS implementation, so these changes affect mainly slower processors such as in-order ones. Signed-off-by: Anton Kirilov <[email protected]>

volyrique · 2025-09-16T21:24:02Z

The HPL host seems to be intermittently inaccessible (I had the same issue locally), so the CI check failed, and I can't retrigger it.

geerlingguy · 2025-09-17T02:23:30Z

@volyrique - Thanks! Looks like the server is stable now, at least. Merged the changes and please feel free to make any other suggestions, as I'm far from an expert on these clustering tools!

geerlingguy reviewed Sep 16, 2025

View reviewed changes

templates/benchmark-Make.top500.j2 Outdated Show resolved Hide resolved

Optimize HPL and MPICH further

7b95e10

HPL spends the majority of its execution time inside the BLAS implementation, so these changes affect mainly slower processors such as in-order ones. Signed-off-by: Anton Kirilov <[email protected]>

volyrique force-pushed the opt branch from bfd2bf2 to 7b95e10 Compare September 16, 2025 21:14

geerlingguy merged commit 41fce33 into geerlingguy:master Sep 17, 2025
3 of 4 checks passed

volyrique deleted the opt branch September 17, 2025 02:27

This was referenced Sep 17, 2025

Benchmark Radxa Oryon O6 Mini ITX board #54

Open

Use the -march=native option when building HPL and MPICH #71

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optimize HPL and MPICH further #70

Optimize HPL and MPICH further #70

Uh oh!

volyrique commented Sep 16, 2025

Uh oh!

Uh oh!

volyrique commented Sep 16, 2025

Uh oh!

Uh oh!

geerlingguy commented Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

Optimize HPL and MPICH further #70

Optimize HPL and MPICH further #70

Uh oh!

Conversation

volyrique commented Sep 16, 2025

Uh oh!

Uh oh!

volyrique commented Sep 16, 2025

Uh oh!

Uh oh!

geerlingguy commented Sep 17, 2025

Uh oh!

Uh oh!