Only call cuda_synchronize when it's truly necessary#2500
Conversation
There was a problem hiding this comment.
Pull request overview
This PR reduces unnecessary GPU synchronizations before MPI/graph-context communication, improving profiling performance (per #2498) by avoiding cuda_synchronize when no inter-process communication work is required.
Changes:
- Skip
cuda_synchronize(...; blocking=true)inweighted_dss_start!when there are no perimeter elements to communicate. - Skip
cuda_synchronizein distributed remapping collection when running with a single process. - In multi-field
Spaces.weighted_dss!, synchronize only when at least one involvedDSSBufferhas non-emptyperimeter_elems.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/Spaces/dss.jl |
Avoids syncing when DSSBuffer.perimeter_elems is empty; minor docstring whitespace adjustments. |
src/Remapping/distributed_remapping.jl |
Avoids syncing before MPI reduction unless nprocs > 1. |
src/Fields/Fields.jl |
Adds a needs_sync guard to avoid syncing unless at least one buffer has perimeter communication work. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| needs_sync = | ||
| dss_buffer1 isa Topologies.DSSBuffer && | ||
| !isempty(dss_buffer1.perimeter_elems) || | ||
| any( | ||
| b isa Topologies.DSSBuffer && !isempty(b.perimeter_elems) | ||
| for (_, b) in field_buffer_pairs | ||
| ) |
There was a problem hiding this comment.
needs_sync relies on &&/|| precedence across line breaks; adding explicit parentheses around the boolean expression would make the intended logic unambiguous and easier to maintain.
imreddyTeja
left a comment
There was a problem hiding this comment.
looks good to me, other than the one spelling comment left by copilot. Thanks!
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Can you think of any other reasons these syncs might have been necessary that would not be covered by the tests? |
I can't, but I am not very familiar with the DSS code. |
| cuda_synchronize(device; blocking = true) | ||
| needs_sync = | ||
| dss_buffer1 isa Topologies.DSSBuffer && | ||
| !isempty(dss_buffer1.perimeter_elems) || |
There was a problem hiding this comment.
When is the perimeter_elems array empty? Isn't there always a boundary between elements when we have a DSSBuffer, or do we allocate empty buffers even when they aren't needed? I'm curious which examples showed a speedup when profiling.
Resolves #2498
This speeds up
nsysprofiling onclimaby about ~10% but doesn't affect SYPD.