[FEA]: Implement CUDA backend for parallel `cuda::std::for_each`

# MVP 
- Implement `cuda::std::for_each()` and call it as
    - `cuda::std::for_each( *no execution policy* )` 
        - Serial implementation, works in both host and device code
    - `cuda::std::for_each(cuda::std::execution::seq, ...)` 
         - Equivalent to above
         - Serial implementation works in both host and device code
- `cuda::std::for_each( cuda::__cub_unseq_par, )`
    - Synchronous. Runs in parallel on the GPU via `cub::DeviceForEach` on the default stream 
    - Because this is synchronous, this will only work in host code. 
        - `cuda::std::for_each(cuda::__cub_unseq_par,)` should ideally fail to compile when used in device code
    - `cuda::__cub_unseq_par` is an internal only execution policy to avoid bikeshedding on what `cuda::std::execution::par_unseq` should mean (CPU vs GPU)
    - Longer term, would like to be able to pass environments as execution policy arguments to allow specifying parallelism and execution place separately since they are technically orthogonal

## Benchmarks
- Simple benchmark for `__half, int, double` to compare with `cub::DeviceForEach` benchmarks and ensure performance parity since `cub::DeviceForEach` is already extensively benchmarked, we are not trying to recreate all of those for `cuda::std::for_each`
- 
## Testing
- Simple `lit`-style functional testing.
- Simple catch2 functional tests

## Non-goals
- Asynchrony/streams
- We will postpone benchmarks until we have support for memory resource passing. Currently the benchmark facilities rely on caching allocators to avoid having the memory subsystem interfere with the measurements. That would currently fail and significantly affect benchmarks



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA]: Implement CUDA backend for parallel `cuda::std::for_each` #6295

MVP

Benchmarks

Testing

Non-goals

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA]: Implement CUDA backend for parallel cuda::std::for_each #6295

Description

MVP

Benchmarks

Testing

Non-goals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[FEA]: Implement CUDA backend for parallel `cuda::std::for_each` #6295