Accelerate Maxwellian boundaries in Flowthrough_amr #1161

ykempf · 2025-08-18T13:02:07Z

During investigations of diffs in #1099 I was baffled again by our diffs in the testpackage Flowthrough_amr test. What struck me today is that diffs were jumping up by factors of 2 to 6 at every change in dt step-by-step when the "normal" changes are diffs growing by about 10% per step or thereabout.

I tested activating

[vlasovsolver]
accelerateMaxwellianBoundaries = 1

which is well-established in production.

In #1099 at today's state https://github.com/fmihpc/vlasiator/actions/runs/16967516987/job/48291190426 the diffs in Flowthrough_amr are (Carrington CI), but this has been known to be a bit instable and sometimes be 100x larger:

 variable                                     | absolute diff | relative diff |
----------
Comparing file Flowthrough_amr/bulk.0000001.vlsv against reference
 proton/vg_rho_0                                0.0193          9.01e-09
 proton/vg_v_0                                  0.00259         1.19e-08
 proton/vg_v_1                                  0.0081          8.01e-07
 proton/vg_v_2                                  0.00564         3.89e-07
 fg_b_0                                         1.07e-17        9.11e-09
 fg_b_1                                         2.04e-17        9.92e-09
 fg_b_2                                         9.98e-18        4.66e-09
 fg_e_0                                         2.05e-12        8.67e-08
 fg_e_1                                         4.96e-12        1.14e-08
 fg_e_2                                         7.56e-12        1.83e-08

Running manually dev vs the branch on LUMI-C, on a single task and single thread, I get

proton/vg_rho 0          5.18    2.41e-06
proton/vg_v 0          0.945   4.35e-06
fg_b 0                     1.03e-15        8.81e-07
fg_e 0                      4.99e-10        2.08e-05

If I activate the Maxwellian boundary acceleration, still single task, single thread, I get

proton/vg_rho 0         0.226   1.05e-07
proton/vg_v 0         0.0522  2.39e-07
fg_b 0                       1.08e-16        9.19e-08
fg_e 0                        8.76e-11        3.69e-06

If I activate the Maxwellian boundary acceleration on 16 tasks x 16 threads I get

proton/vg_rho 0       0.0174  8.07e-09
proton/vg_v 0           0.00789 3.61e-08
fg_b 0                         1.89e-18        1.61e-09
fg_e 0                          2.4e-12 1.01e-07

In the latter two cases, the diffs don't increase monotonically any more for all variables, so this doesn't rid us of all diffs. I'm also a bit surprised the 16 x 16 diverges less than the serial version. But well.

Note that dt changes "macroscopically" of course, and new reference data will be needed after this, but it seems to make Flowthrough_amr less diffy.

ykempf · 2025-08-18T13:48:36Z

Here are the results for similar tests with Flowthrough_damr.

As reference, that #1099 latest stage linked above:

Comparing file Flowthrough_damr/bulk.0000004.vlsv against reference
 proton/vg_rho_0                                8.05            2.7e-06
 proton/vg_v_0                                  0.229           9.79e-07
 proton/vg_v_1                                  0.162           1.31e-05
 proton/vg_v_2                                  0.218           1.51e-05
 vg_amr_alpha1_0                                1.41e-06        1.41e-06
 vg_amr_alpha2_0                                9.87e-07        1.14e-06
 fg_b_0                                         2.13e-15        1.92e-06
 fg_b_1                                         3.5e-15         1.08e-06
 fg_b_2                                         2.06e-15        6.27e-07
 fg_e_0                                         2.15e-10        2.71e-05
 fg_e_1                                         4.5e-10         6.23e-07
 fg_e_2                                         5.36e-10        7.55e-07

The below run on LUMI-C.

On 16 tasks x 16 threads, default case (existing testpackage settings), I get:

proton/vg_rho
5       18.5    6.21e-06
proton/vg_v
5       0.868   3.72e-06
--meshname=fsgrid fg_b
5       3.58e-15        3.22e-06
--meshname=fsgrid fg_e
5       4e-10   5.11e-05

On 32 tasks x 1 thread each, Maxwellian boundary acc on, comparing the last file, I get (absolute and relative 0-diff listed)

proton/vg_rho
5       7.99    2.68e-06
proton/vg_v
5       0.149   6.37e-07
--meshname=fsgrid fg_b
5       5.26e-15        4.65e-06
--meshname=fsgrid fg_e
5       2.53e-10        3.36e-05

On 1 task x 64 threads, Maxwellian boundary acc on:

proton/vg_rho
5       9.16    3.07e-06
proton/vg_v
5       0.312   1.34e-06
--meshname=fsgrid fg_b
5       2.7e-14 2.44e-05
--meshname=fsgrid fg_e
5       6.04e-10        8.36e-05

On 16 tasks x 16 threads, Maxwellian acc on:

ERROR Datasets have different size.
4       17.1    5.93e-06
proton/vg_v
ERROR Datasets have different size.
4       0.436   1.87e-06
--meshname=fsgrid fg_b
5       6.28e-11        0.0565
--meshname=fsgrid fg_e
5       4.09e-06        0.52

So for this test, the benefit of accelerating these boundary cells is not obvious/this doesn't fix anything.

ykempf · 2025-08-18T17:28:19Z

Sneak peek: Flowthrough_damr run serially has slightly larger diffs with accelerated Maxwellian boundaries than without, so the diff source of this test isn't affected it seems. I'll let that run overnight.

ykempf · 2025-08-18T18:57:44Z

Final state of Flowthrough_damr run serially, for the base case and the Maxwellian-accelerated case, in dev and in the #1099 branch.

Flowthrough_damr_serial
proton/vg_rho
The absolute 0-distance between both datasets is 8.73
The relative 0-distance between both datasets is 2.92e-06
proton/vg_v
The absolute 0-distance between both datasets is 0.139
The relative 0-distance between both datasets is 5.96e-07
--meshname=fsgrid fg_b
The absolute 0-distance between both datasets is 3.88e-15
The relative 0-distance between both datasets is 3.49e-06
--meshname=fsgrid fg_e
The absolute 0-distance between both datasets is 2.49e-10
The relative 0-distance between both datasets is 2.93e-05


Flowthrough_damr_serial_accelerateMaxwellianBoundaries
proton/vg_rho
The absolute 0-distance between both datasets is 13.1
The relative 0-distance between both datasets is 4.39e-06
proton/vg_v
The absolute 0-distance between both datasets is 0.38
The relative 0-distance between both datasets is 1.63e-06
--meshname=fsgrid fg_b
The absolute 0-distance between both datasets is 2.9e-14
The relative 0-distance between both datasets is 2.61e-05
--meshname=fsgrid fg_e
The absolute 0-distance between both datasets is 6.53e-10
The relative 0-distance between both datasets is 7.68e-05

Which is fascinating as I was ascribing at least part of these diffs (compare the 16x16 CI values above) to the "well-known" AMR pencil summation + threading. 🤔

ykempf · 2025-08-19T10:51:00Z

Here is the vg_rho diff data for a full run of FLowthrough_damr run on single task and single thread, with the base case and the accelerated Maxwellian boundary case, diffing the #1099 branch vs dev.

ykempf · 2025-08-19T11:18:08Z

Now with the base case diff run on 16 x 16 threads x ranks – the errors are in the same ballpark. Note however that the last 2 files are not available as they end up with a different AMR grid (a known feature of this test at times...).

ykempf · 2025-08-19T11:19:43Z

I'd still advocate for activating the Maxwellian boundary acceleration as suggested in this branch given it alleviated some spurious diffs we can do without when working on some branch.

ykempf · 2025-08-19T15:43:34Z

Fell again into that rabbit hole.

lowering the phase-space density threshold by 2 orders of magnitude
disabling coarsening
disabling the randomised order of acceleration
not using Agner
using short pencils
compiling with O0
disabling jemalloc
disabling fsgrid filtering

all have little to no effect on the errors/diffs appearing in Flowthrough_damr. Add to that the fact running serially yields the same diffs and I am still utterly baffled. I can't help but think there must be some subtle use of uninitialised memory somewhere, or some incorrect pencil neighbor? pointer? VDF? use that's not quite wrong but not quite right either, and unrelated to all of the parameters tested above, notably OpenMP threading and MPI/ghosts.

ykempf · 2025-08-20T08:19:01Z

All right in my current test setup using Flowthrough_damr I see diffs (as have been commonly observed in the last couple of years), with none of the above making much difference. What I do observe is that the diffs are the same/very similar no matter what LB algorithm I use, if I change the LB frequency but retain the AMR at the same times, and especially when run serially.

This means that I suggest to activate this Maxwellian acceleration for Flowthrough_amr as it seemingly "fixes" some of the diffs we were disliking so far, but I leave the remaining investigation of the cause of the Flowthrough_damr diffs to a later stage/other investigator.

Accelerate Maxwellian boundaries in Flowthrough_amr

467a4cc

ykempf assigned markusbattarbee, ursg, alhom and lkotipal Aug 18, 2025

This was referenced Aug 20, 2025

FsGrid & fieldsolver refactoring #1099

Open

Chasing the cause of diffs in testpackage Flowthrough_damr #1162

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accelerate Maxwellian boundaries in Flowthrough_amr #1161

Accelerate Maxwellian boundaries in Flowthrough_amr #1161

Uh oh!

ykempf commented Aug 18, 2025

Uh oh!

ykempf commented Aug 18, 2025

Uh oh!

ykempf commented Aug 18, 2025

Uh oh!

ykempf commented Aug 18, 2025

Uh oh!

ykempf commented Aug 19, 2025

Uh oh!

ykempf commented Aug 19, 2025

Uh oh!

ykempf commented Aug 19, 2025

Uh oh!

ykempf commented Aug 19, 2025 •

edited

Loading

Uh oh!

ykempf commented Aug 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Accelerate Maxwellian boundaries in Flowthrough_amr #1161

Are you sure you want to change the base?

Accelerate Maxwellian boundaries in Flowthrough_amr #1161

Uh oh!

Conversation

ykempf commented Aug 18, 2025

Uh oh!

ykempf commented Aug 18, 2025

Uh oh!

ykempf commented Aug 18, 2025

Uh oh!

ykempf commented Aug 18, 2025

Uh oh!

ykempf commented Aug 19, 2025

Uh oh!

ykempf commented Aug 19, 2025

Uh oh!

ykempf commented Aug 19, 2025

Uh oh!

ykempf commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ykempf commented Aug 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ykempf commented Aug 19, 2025 •

edited

Loading