Reduce lock contention in spilling

As discussed in https://github.com/rapidsai/rapidsmpf/issues/657#issuecomment-3549102684, spill threads can spend a significant proportion of their time waiting for a lock while some other thread does its spilling:

<img width="981" height="596" alt="Image" src="https://github.com/user-attachments/assets/72b597bf-7ea0-405b-acd1-14c6dc790fce" />

At the moment, that lock protects both

- spilling: https://github.com/rapidsai/rapidsmpf/blob/1df992ad5be382eb2580b73bbdf32a1134876d77/cpp/src/shuffler/shuffler.cpp#L818-L819
- extraction: https://github.com/rapidsai/rapidsmpf/blob/1df992ad5be382eb2580b73bbdf32a1134876d77/cpp/src/shuffler/shuffler.cpp#L771

As discussed in https://github.com/rapidsai/rapidsmpf/issues/657, the actual act of spilling buffers (allocating host memory, doing the host to device transfer) can take a substantial amount of time. *If* the lock is only there to protect attempting to spill the *same* buffer multiple times, we might be able to refactor the code split `postbox_spilling` into two distinct phases:

1. A phase to determine which set of buffers to spill, in order to reach some target amount of bytes spilled
2. A phase to actually spill that set of buffers identified in phase one.

That kind of spilling should only require a lock for phase 1.

However, because the same lock is used for extraction, things might be harder. Threads doing an `extract` (with the lock) might rely on some invariant like a buffer either being in device memory or host memory, but not in the process of being spilled. We might need to introduce a new "spilling" state, but I'm hazy on the details at this point.

This is related to the broader themes around spilling performance discussed in github.com/rapidsai/rapidsmpf/issues/657. As we improve the performance of spilling, lock contention ought to go down since the spill thread will spend less time actually spilling, and so will spend less time holding the lock with today's implementation.

cc @nirandaperera.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce lock contention in spilling #674

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	std::lock_guard<std::mutex> lock(ready_postbox_spilling_mutex_);
	spilled = postbox_spilling(br_, comm_->logger(), ready_postbox_, spill_need);

Reduce lock contention in spilling #674

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions