Skip to content

[ENHANCEMENT]: Implement cooperative device bulk operations #674

@sleeepyjack

Description

@sleeepyjack

Pitch

Device bulk operations such as

template <class CG, class InputIt>
__device__ void insert(CG const& group, InputIt first, InputIt last)

, i.e., assigning a cooperative group to process N input elements cooperatively, allow for adaptive and optimized work distribution.

Background

Our current device CG APIs look as follows:

template <class CG, class Key>
__device__ void insert(CG const& group, Key key)

, i.e., they take a single input datum and cooperatively insert it into the container. Each thread in the CG is expected to get the same key as input.

The problem with this approach is that it leads to redundant computation:
Each thread in the CG needs to compute the hash value of the (exact same) input key, before proceeding to the probing/insertion part. This imposes unnecessary pressure on the compute pipelines, which reduces overall throughput mainly in cases where the kernel is compute bound (which likely happens if the data structure fits (partially) into the L2$). However, experiments show that even for large data structures (>2GB) the performance improvement is still visible (>5%).

The bulk device operation eliminates this drawback as follows:
Each thread loads one input key at a time, resulting in a coalesced memory acces pattern, and then computes its respective hash value. From here we go into a cooperative loop, iterating over all ranks in the group. At each iteration, we shuffle the computed hash value from the target rank to all threads in the group (broadcast). From here we can proceed with the standard cooperative insertion/probing approach, as each thread holds the same hash value.

Another benefit is in the API:
The user may give us any CG size as input, which we can then adaptively subdivide into multiple worker groups that insert the given range of input keys as efficient as possible. In that sense, this API becomes adaptive in relation to the given input CG size.

One example of this approach is given in #672

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: improvementImprovement / enhancement to an existing function

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions