Skip to content

Only call cuda_synchronize when it's truly necessary#2500

Open
petebachant wants to merge 8 commits into
mainfrom
pb/dss-sync
Open

Only call cuda_synchronize when it's truly necessary#2500
petebachant wants to merge 8 commits into
mainfrom
pb/dss-sync

Conversation

@petebachant
Copy link
Copy Markdown
Member

Resolves #2498

This speeds up nsys profiling on clima by about ~10% but doesn't affect SYPD.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces unnecessary GPU synchronizations before MPI/graph-context communication, improving profiling performance (per #2498) by avoiding cuda_synchronize when no inter-process communication work is required.

Changes:

  • Skip cuda_synchronize(...; blocking=true) in weighted_dss_start! when there are no perimeter elements to communicate.
  • Skip cuda_synchronize in distributed remapping collection when running with a single process.
  • In multi-field Spaces.weighted_dss!, synchronize only when at least one involved DSSBuffer has non-empty perimeter_elems.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/Spaces/dss.jl Avoids syncing when DSSBuffer.perimeter_elems is empty; minor docstring whitespace adjustments.
src/Remapping/distributed_remapping.jl Avoids syncing before MPI reduction unless nprocs > 1.
src/Fields/Fields.jl Adds a needs_sync guard to avoid syncing unless at least one buffer has perimeter communication work.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/Spaces/dss.jl Outdated
Comment thread src/Fields/Fields.jl
Comment on lines +467 to +473
needs_sync =
dss_buffer1 isa Topologies.DSSBuffer &&
!isempty(dss_buffer1.perimeter_elems) ||
any(
b isa Topologies.DSSBuffer && !isempty(b.perimeter_elems)
for (_, b) in field_buffer_pairs
)
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs_sync relies on &&/|| precedence across line breaks; adding explicit parentheses around the boolean expression would make the intended logic unambiguous and easier to maintain.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

@imreddyTeja imreddyTeja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, other than the one spelling comment left by copilot. Thanks!

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@petebachant
Copy link
Copy Markdown
Member Author

looks good to me, other than the one spelling comment left by copilot. Thanks!

Can you think of any other reasons these syncs might have been necessary that would not be covered by the tests?

@imreddyTeja
Copy link
Copy Markdown
Member

looks good to me, other than the one spelling comment left by copilot. Thanks!

Can you think of any other reasons these syncs might have been necessary that would not be covered by the tests?

I can't, but I am not very familiar with the DSS code.

Comment thread src/Fields/Fields.jl
cuda_synchronize(device; blocking = true)
needs_sync =
dss_buffer1 isa Topologies.DSSBuffer &&
!isempty(dss_buffer1.perimeter_elems) ||
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is the perimeter_elems array empty? Isn't there always a boundary between elements when we have a DSSBuffer, or do we allocate empty buffers even when they aren't needed? I'm curious which examples showed a speedup when profiling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Don't synchronize GPUs in dss! if only running on one

4 participants