Skip to content

[FEA]: Implement CUDA backend for parallel cuda::std::for_each #6295

@jrhemstad

Description

@jrhemstad

MVP

  • Implement cuda::std::for_each() and call it as
    • cuda::std::for_each( *no execution policy* )
      • Serial implementation, works in both host and device code
    • cuda::std::for_each(cuda::std::execution::seq, ...)
      • Equivalent to above
      • Serial implementation works in both host and device code
  • cuda::std::for_each( cuda::__cub_unseq_par, )
    • Synchronous. Runs in parallel on the GPU via cub::DeviceForEach on the default stream
    • Because this is synchronous, this will only work in host code.
      • cuda::std::for_each(cuda::__cub_unseq_par,) should ideally fail to compile when used in device code
    • cuda::__cub_unseq_par is an internal only execution policy to avoid bikeshedding on what cuda::std::execution::par_unseq should mean (CPU vs GPU)
    • Longer term, would like to be able to pass environments as execution policy arguments to allow specifying parallelism and execution place separately since they are technically orthogonal

Benchmarks

  • Simple benchmark for __half, int, double to compare with cub::DeviceForEach benchmarks and ensure performance parity since cub::DeviceForEach is already extensively benchmarked, we are not trying to recreate all of those for cuda::std::for_each

Testing

  • Simple lit-style functional testing.
  • Simple catch2 functional tests

Non-goals

  • Asynchrony/streams
  • We will postpone benchmarks until we have support for memory resource passing. Currently the benchmark facilities rely on caching allocators to avoid having the memory subsystem interfere with the measurements. That would currently fail and significantly affect benchmarks

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions