-
Notifications
You must be signed in to change notification settings - Fork 100
Description
Pitch
Device bulk operations such as
template <class CG, class InputIt>
__device__ void insert(CG const& group, InputIt first, InputIt last), i.e., assigning a cooperative group to process N input elements cooperatively, allow for adaptive and optimized work distribution.
Background
Our current device CG APIs look as follows:
template <class CG, class Key>
__device__ void insert(CG const& group, Key key), i.e., they take a single input datum and cooperatively insert it into the container. Each thread in the CG is expected to get the same key as input.
The problem with this approach is that it leads to redundant computation:
Each thread in the CG needs to compute the hash value of the (exact same) input key, before proceeding to the probing/insertion part. This imposes unnecessary pressure on the compute pipelines, which reduces overall throughput mainly in cases where the kernel is compute bound (which likely happens if the data structure fits (partially) into the L2$). However, experiments show that even for large data structures (>2GB) the performance improvement is still visible (>5%).
The bulk device operation eliminates this drawback as follows:
Each thread loads one input key at a time, resulting in a coalesced memory acces pattern, and then computes its respective hash value. From here we go into a cooperative loop, iterating over all ranks in the group. At each iteration, we shuffle the computed hash value from the target rank to all threads in the group (broadcast). From here we can proceed with the standard cooperative insertion/probing approach, as each thread holds the same hash value.
Another benefit is in the API:
The user may give us any CG size as input, which we can then adaptively subdivide into multiple worker groups that insert the given range of input keys as efficient as possible. In that sense, this API becomes adaptive in relation to the given input CG size.
One example of this approach is given in #672