RFC: Async overlap of CPU and GPU compute during dynamic inference step

This RFC the following plan for how to best optimize the dynamic inference step function.

There are several interconnected issues at play:

- Dynamic sampling code is currently very unoptimized. There is a PR draft that reimplements it.
- `async_generate_output_tokens_dynamic_batch` mixes CPU and GPU operations indiscriminately.
- `async_generate_output_tokens_dynamic_batch` may be declared async, but it has no good way of yielding the event loop. A lot of CPU time is wasted waiting for the GPU, and can be reclaimed.

The ideal solution appears to be:
 - Fix dynamic sampling code.
 - Clearly separate CPU and GPU operations.
 - Provide a place to yield the event loop.

The PR series suggested by this RFC are:
1. Break `async_generate_output_tokens_dynamic_batch` apart into multiple sub-methods, which are clearly labeled as "CPU compute" vs "GPU compute".
    - Achieved by #1992.
2. Implement barebones unoptimized dynamic sampling code.
    - Achieved by #1927.
3. Tensorize the dynamic sampling bookkeeping.
    - Achieved by #2105.
4. Reorganize the step function to allow for async step
    - Achieved by #2192. No new functionality, just moving code around.
5. Reorder the sub-methods from point 1) so that CPU/GPU compute forms separate continuous blocks of code, and yield the event loop after the CPU compute via torch polling.
   - Due to all the prep work, this will be a tiny, extremely readable, PR.
    - Achieved by #2193.
6. Optimize dynamic sampling code via graphed FlashInfer sampling.
    - A draft has been written by @kanz-nv; @tdene will finish it.
7. Refactor dynamic logprobs computation to follow the same style as the new sampling code.
   - A draft has been written by @tdene.
8. Reconcile with main's implementation of `top_n_logprobs`.
9. Wait for a torch update, or brainstorm a way to yield the event loop without polling in the current version of pytorch.
   - Maybe by sampling on a single rank, instead of the current sampling on every rank?
   - Will discuss further in comments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Async overlap of CPU and GPU compute during dynamic inference step #2019

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Async overlap of CPU and GPU compute during dynamic inference step #2019

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions