Skip to content

Chasing the cause of diffs in testpackage Flowthrough_damr #1162

@ykempf

Description

@ykempf

It has been observed for a year or two that Flowthrough_damr yields relative diffs of order 1e-6 in testpackage runs even in cases when a branch being tested only changes things that do not affect the computations directly, e.g. #1099 in recent weeks for me. As discussed at #1099 and #1161 I dug deeper and this issue is to keep notes of what I observed, and how to reproduce for further investigation.

The key new observation is that I get diffs even when running Flowthrough_damr serially, meaning that our previous interpretation that these were the result of threading/summation differences especially at AMR and MPI domain boundaries is at best incomplete. Of course threading etc will amplify diffs possibly, but the setup documented here yields diffs serially, again.

Another key observation is that the diffs jump up significantly (sometimes more than x10 increase) when AMR happens, but not at plain LB, which I confirmed by setting LB to occur every 5 steps, and AMR to occur every 20 steps (so faster LB but same AMR intervals as the default Flowthrough_amr). I did not see strong jumps in diffs that are not just after an AMR.

Things that had little or no impact on the overall level of diffs include

  • changing the LB algorithm (not relevant given the serial diffs observation)
  • compiling with O0
  • disabling jemalloc
  • using short pencils
  • using fallback vectorisation
  • disabling the randomised order of acceleration directions
  • disabling fsgrid filtering
  • 1st order space and time

Steps to reproduce and see the diffs:

  • build dev and a branch, e.g. the current state of FsGrid & fieldsolver refactoring #1099
  • run Flowthrough_damr (increase IO cadence to write a file every 0.8 s i.e. every 10 steps)
  • run vlsvdiff_DP for proton/vg_rho
  • the diffs go from relative 1e-12 at the beginning to 1e-6 about half-way through the test.

Changing some of the parameters I tested above may change e.g. how early the first big jump in diffs occurs, but none got rid of that jump and subsequent increase in diffs, as opposed to what I would see as acceptable levels of ~1e-12 relative diffs throughout without such massive jumps.

It would seem that short pencils "help" in bringing the abrupt jumps to the fore.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions