Skip to content

Conversation

@davebayer
Copy link
Contributor

Previously, we had a special overload for cases when the user passed cuda::std::reference_wrapper as the callable without any arguments.

This PR removes this overload and handles it inside the generic implementation. In addition, also functions returning void without arguments are launched using cuLaunchHostFunc which doesn't require memory allocation.

@davebayer davebayer requested a review from a team as a code owner November 19, 2025 14:51
@github-project-automation github-project-automation bot moved this to Todo in CCCL Nov 19, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Nov 19, 2025
@davebayer davebayer self-assigned this Nov 19, 2025
@davebayer davebayer force-pushed the remove_host_launch_ref_wrapper branch from c313a52 to 8b64eb2 Compare November 19, 2025 16:06
@github-actions

This comment has been minimized.

@davebayer davebayer requested a review from pciolkosz November 19, 2025 18:35
//! @param __stream Stream to launch the host function on
//! @param __callable A reference to a host function or callable object to call in stream order
template <class _Callable>
_CCCL_HOST_API void host_launch(stream_ref __stream, ::cuda::std::reference_wrapper<_Callable> __callable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to keep the separate overload, it's easier to document this mode. Otherwise with one overload you need to describe the set of conditions to avoid the allocation, where here you have them expressed in code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree, actually I think that it makes everything much easier. We would have a single function. We can simply document the behaviour without talking about memory allocations, pointing out the option to use cuda::std::reference_wrapper for cases when the user wants to pass a reference to a callable or an argument.

Then, we usually have Performance Considerations section where we would describe that if there are no parameters passed to the function and the function is either a free function or a cuda::std::reference_wrapper we use cuLaunchHostFunc without memory allocations and cuStreamAddCallback otherwise.

I think this makes everything much cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People usually just glance the documentation and they will immediately notice two overloads, a very small subset will read the performance considerations section.
I actually think the overload you are removing is more important than the other one and should be used more often, that's why I want it to be as visible as possible.

// We use the callback here to have it execute even on stream error, because it needs to free the above allocation
::cuda::__driver::__streamAddCallback(__stream.get(), __stream_callback_launcher<_CallbackData>, __callback_data_ptr);
}
if constexpr (!__has_args && ::cuda::std::is_function_v<_Callable> && ::cuda::std::is_pointer_v<_Callable>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this

@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 34m: Pass: 100%/90 | Total: 12h 57m | Max: 53m 25s | Hits: 99%/213937

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants