[docs] Attention backends + continuous batching #42329
Open
+211
−211
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Refactors the optimization docs, starting with the attention backend and continuous batching docs (more updates to come in subsequent PRs!).
toctreeto create a clearer inference optimization section for inference (adds theAttentionInterface, continuous batching, and kernels guides)ContinuousMixinandContinuousBatchingManagerto the API referenceContinuousBatchingManagerexample to better show the different options for using it, adds clearer PagedAttention and sliding window attention sections, high-level description of how the mixin, manager, and scheduler work together, move the monitoring section to thetransformers servedocs) - maybe you can review this section @McPatate 🙏