Improve turbomind's prefix cache #3835

lvhan028 · 2025-08-13T09:12:01Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

irexyc · 2025-08-22T07:25:12Z

lmdeploy/cli/chat.py

@@ -80,6 +83,7 @@ def main(model_path, backend, **kwargs):
                try:
                    for resp in resps:
                        print(resp.text, end='', flush=True)
+                        pass


seems unnecessary

irexyc · 2025-08-22T09:30:31Z

src/turbomind/models/llama/LlamaBatch.cc

+    const int output_len = state_->h_context_length[index];
+    auto&     seq        = *state_->sequences[index];

-        // Update token IDs
-        seq.tokens.resize(output_len);
+    // Update token IDs
+    seq.tokens.resize(output_len);

-        // output_ids is updated & synced in `Finish`
-        const auto output_ids = state_->requests[index]->output_ids.data();
-        std::copy_n(output_ids, output_len, seq.tokens.data());
+    // output_ids is updated & synced in `Finish`
+    const auto output_ids = state_->requests[index]->output_ids.data();
+    std::copy_n(output_ids, output_len, seq.tokens.data());
+    // Cache the generated tokens of the sequence
+    if (!force_stop) {
+        sequence_manager_->CacheGeneration(seq);
+    }

-        // Save random state in host memory
-        seq.random_state.resize(sizeof(curandState_t));
-        // This async copy must be synchronized by the caller
-        core::Copy((curandState_t*)state_->curand_state.data() + index, 1, (curandState_t*)seq.random_state.data());
+    // Save random state in host memory
+    seq.random_state.resize(sizeof(curandState_t));
+    // This async copy must be synchronized by the caller
+    core::Copy((curandState_t*)state_->curand_state.data() + index, 1, (curandState_t*)seq.random_state.data());

-        // Set unlock flag for corresponding blocks, will be unlocked in the next `Materialize()`
-        sequence_manager_->UpdateAndSetUnlock(seq);
+    // Set unlock flag for corresponding blocks, will be unlocked in the next `Materialize()`
+    sequence_manager_->UpdateAndSetUnlock(seq);
+
+    if (state_->requests[index]->session.end_flag) {
+        // Sequence is ending this round
+        FT_CHECK(sequence_manager_->Erase(state_->requests[index]->id));


I think we should refer to previous code here. With state_->requests[index]->session.end_flag=True, there is no need to output token id and save random_state.

irexyc · 2025-08-22T09:40:21Z

src/turbomind/models/llama/LlamaBatch.cc

+            if (param_.enable_prefix_caching && !r->session.start_flag) {
+                // Prefix caching is incompatible with interactive mode
+                TM_LOG_ERROR("Skip inconsistent %s request for ID %lu", type, r->id);
+                r->ec = Request::kInconsistency;
+            }


I think we could check session.step == 0 here, so that we don't need to modify line 233-236, 262 and maybe 1018.

irexyc · 2025-08-22T09:41:26Z

src/turbomind/models/llama/LlamaBatch.cc

-            const int history_len = state_->sequences[i]->tokens.size();
+            const int cache_len = state_->sequences[i]->cache_len;
+            const int history_len =
+                !param_.enable_prefix_caching ? state_->sequences[i]->tokens.size() : state_->requests[i]->session.step;


I am confused about the setting state_->requests[i]->session.step

lvhan028 added 5 commits July 16, 2025 16:40

compile successfully

83f7a4b

trail whitespaces

1886c85

Merge branch 'main' into tm-prefix-cache-v1

e428dac

Merge branch 'main' into tm-prefix-cache-v1

e0713c1

update

e704057

lvhan028 requested review from irexyc and lzhangzz and removed request for irexyc August 13, 2025 09:12

lvhan028 added the improvement label Aug 13, 2025

lvhan028 mentioned this pull request Aug 13, 2025

Improve turbomind's prefix cache #3332

Closed

8 tasks

lvhan028 requested a review from irexyc August 13, 2025 09:13

lvhan028 added 7 commits August 13, 2025 17:40

fix linting

7f9ae8a

fix linting

a04c4db

fix linting

5cf1902

put kInconsistency to DisableInvalidRequests

81c547a

update according to reviewer comments

039bce9

merge main to resolve conflicts

c37d401

fix according to reviewer's comments

75915ce

lzhangzz approved these changes Aug 19, 2025

View reviewed changes

lvhan028 added 2 commits August 19, 2025 22:42

interactive chat cannot be used when prefix caching is enabled

338cae9

fix

e21cbd6

irexyc reviewed Aug 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve turbomind's prefix cache #3835

Improve turbomind's prefix cache #3835

Uh oh!

lvhan028 commented Aug 13, 2025

Uh oh!

irexyc Aug 22, 2025

Uh oh!

irexyc Aug 22, 2025

Uh oh!

irexyc Aug 22, 2025

Uh oh!

irexyc Aug 22, 2025

Uh oh!

Uh oh!

Improve turbomind's prefix cache #3835

Are you sure you want to change the base?

Improve turbomind's prefix cache #3835

Uh oh!

Conversation

lvhan028 commented Aug 13, 2025

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

Uh oh!

irexyc Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

irexyc Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

irexyc Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

irexyc Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!