Skip to content

feat(tiered-storage): support multi-tier KV cache migration#209

Open
shaohuaxi wants to merge 8 commits into
mainfrom
feat/tiered-storage-dev
Open

feat(tiered-storage): support multi-tier KV cache migration#209
shaohuaxi wants to merge 8 commits into
mainfrom
feat/tiered-storage-dev

Conversation

@shaohuaxi

Copy link
Copy Markdown
Collaborator

Summary

Introduce multi-tier KV cache migration with two complementary paths
(Copy and Mark) for hot → cold tier movement, driven by water-level
strategies and a management API. Split into 6 review-friendly commits
ordered by dependency layers:

1. `bb81fd8` Add `MigrationStrategy` config and admin proto
2. `ce0cdd0` Add Copy primitive in DataStorage and SchedulePlanExecutor
3. `2b23bd6` Add MigrationManager and migration events to orchestrate copy/mark tasks
4. `a20d8f6` Drive cache migration from CacheReclaimer water level with admission filtering
5. `d505363` Consume tiered-write Marks in CacheManager write path, routing per block to cold tier
6. `fd0da23` Expose `MigrateCache` admin API (gRPC/HTTP)

## Highlights

- **Copy path**: when storage usage exceeds `trigger_threshold`, the
  reclaimer picks the coldest batch by LRU and submits cross-storage
  block copies through MigrationManager.
- **Mark path**: blocks are tagged on the write side; subsequent
  `StartWriteCache` routes new writes directly to the target cold
  storage based on the mark.
- **Admission control**: explicit rejection reasons for active
  migration, target already SERVING/WRITING, and source SERVING
  missing — surfaced via metrics and logs for diagnosis.
- **Admin API**: supports both explicit block_keys and sampling-rule
  modes; method may be COPY / MARK / BOTH, where BOTH falls back to
  mark when copy submission fails.

## Test plan

- [x] `//kv_cache_manager/config/test:migration_strategy_test`
- [x] `//kv_cache_manager/manager/test:MigrationManagerTest`
- [x] `//kv_cache_manager/manager/test:CacheReclaimerTest` (water-level gate integration)
- [x] `//kv_cache_manager/manager/test:CacheManagerTest` (tiered Mark write-path)
- [x] `//kv_cache_manager/service/test:AdminServiceImplTest` (BOTH fallback, min_age_ms filtering)

@CLAassistant

CLAassistant commented Jun 15, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd0da23160

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kv_cache_manager/manager/migration_manager.cc
Comment thread kv_cache_manager/manager/migration_manager.h Outdated
Comment thread kv_cache_manager/manager/cache_manager.cc
Comment thread kv_cache_manager/manager/cache_reclaimer.cc Outdated
Comment thread kv_cache_manager/manager/cache_manager.cc Outdated
Comment thread kv_cache_manager/service/admin_service_impl.cc Outdated
Comment thread kv_cache_manager/manager/cache_manager.cc Outdated
Comment thread kv_cache_manager/manager/migration_manager.cc Outdated
Comment thread kv_cache_manager/data_storage/data_storage_backend.h
@shaohuaxi shaohuaxi force-pushed the feat/tiered-storage-dev branch from fd0da23 to c17fe5f Compare June 15, 2026 08:03

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c17fe5f45f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kv_cache_manager/manager/migration_manager.cc Outdated
Comment thread kv_cache_manager/manager/cache_manager.cc Outdated
Comment thread kv_cache_manager/manager/cache_reclaimer.cc
Comment thread kv_cache_manager/manager/migration_manager.cc
Comment thread kv_cache_manager/manager/cache_reclaimer.cc
Comment on lines +1223 to +1225
if (capacity <= 0) {
LOG_WITH_GR(DEBUG, "no positive capacity for source type of [%s]; skip migration", src_name.c_str());
continue;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor group quota when type quota is absent

Automatic migration now requires a positive per-storage-type entry in quota_config; if an instance group only has the regular group quota.capacity (a valid existing configuration, since quota_config is optional), every migration strategy hits this branch and is skipped even when the group water level exceeds trigger_threshold. This makes water-level migration silently never start for groups that rely on the existing group-level quota.

Useful? React with 👍 / 👎.

Comment thread kv_cache_manager/service/server.cc Outdated
Comment thread kv_cache_manager/manager/cache_manager.cc
Comment thread kv_cache_manager/service/admin_service_impl.cc Outdated

@qoderai qoderai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This is a large, well-structured feature that adds a complete hot→cold migration pipeline across 41 files, layered sensibly (config → storage primitive → manager → reclaimer integration → admin API). The commit decomposition and test coverage are good for a first-pass review. The existing Codex review already captured 18 inline issues; the notes below cover distinct problems not yet raised.

New findings

Correctness

  • SubmitNonBlocking(CacheLocationCopyRequest) double-queues every copy task and silently drops the result future — likely never the correct behavior. (See inline on schedule_plan_executor.cc:562.)
  • MigrationMarkMethod::mark_timeout_ms_ defaults to 0, which causes every mark to expire instantly on the first cleanup sweep when the field is omitted from JSON. (See inline on migration_strategy.h:70.)
  • In BOTH mode the admin API drops blocks rejected at admission (not just blocks rejected by BatchSubmit) instead of falling back to mark for them. (See inline on admin_service_impl.cc:623.)

Defensive / API contract

  • DataStorageManager::Copy does not validate src_uris.size() == dst_uris.size() before delegating to the backend, leaving that contract unenforced for future non-dummy backends. (See inline on data_storage_manager.cc.)
  • MigrationCompletedEvent hardcodes "method": "copy" even in event contexts that will be shared with the mark path. (See inline on migration_event.h:70.)

Operational risk

  • KEEP_BOTH + Copy without a matching reclaim strategy produces un-reclaimable source replicas with no config-time warning. (See inline on migration_strategy.cc:90.)

🤖 Generated by Qoder

Comment thread kv_cache_manager/manager/schedule_plan_executor.cc Outdated
Comment thread kv_cache_manager/event/spec_events/migration_event.h
Comment thread kv_cache_manager/config/migration_strategy.h
Comment thread kv_cache_manager/service/admin_service_impl.cc Outdated
Comment thread kv_cache_manager/config/migration_strategy.cc Outdated
@github-actions github-actions Bot added the ai reviewed AI has reviewed this PR label Jun 15, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: deb7b967e9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kv_cache_manager/data_storage/data_storage_manager.cc Outdated
Comment thread kv_cache_manager/service/admin_service_impl.cc
Comment on lines +1467 to +1469
const auto select_result = data_storage_selector_->SelectCacheWriteDataStorageBackend(
request_context, instance_info->instance_group_name());
RETURN_IF_EC_NOT_OK_WITH_LOG(WARN, select_result.ec, "select storage backend failed");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't require default storage for tiered writes

When every key being written has a valid tiered target, this still selects the normal/default write backend before looking at tiered_targets and returns on any selection error. If the hot/default candidates are disabled or over quota—the exact condition that can trigger mark-based migration—StartWriteCache fails before it can allocate on the cold target, so marked blocks cannot be drained to the tiered storage even though no default allocation is needed.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b12adea978

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kv_cache_manager/manager/cache_manager.cc Outdated
Comment thread kv_cache_manager/manager/migration_manager.cc Outdated
Comment thread kv_cache_manager/manager/cache_manager.cc Outdated
Comment thread kv_cache_manager/config/migration_strategy.cc
Comment thread kv_cache_manager/manager/cache_reclaimer.cc Outdated
@shaohuaxi shaohuaxi force-pushed the feat/tiered-storage-dev branch from b12adea to 715f5bc Compare June 16, 2026 15:37

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 715f5bc7b8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kv_cache_manager/manager/migration_manager.cc
Comment thread kv_cache_manager/manager/cache_reclaimer.cc Outdated
Comment thread kv_cache_manager/manager/cache_manager.cc Outdated
Comment thread kv_cache_manager/manager/migration_manager.cc Outdated
Comment thread kv_cache_manager/manager/cache_manager.cc
@fengfeisha fengfeisha force-pushed the feat/tiered-storage-dev branch from c23d077 to 715f5bc Compare June 17, 2026 06:19
@shaohuaxi shaohuaxi force-pushed the feat/tiered-storage-dev branch from 715f5bc to d4ab700 Compare June 17, 2026 06:56

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d4ab700b2b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kv_cache_manager/service/admin_service_impl.cc
Comment thread kv_cache_manager/manager/cache_reclaimer.cc
Comment thread kv_cache_manager/manager/cache_manager.cc Outdated
@shaohuaxi shaohuaxi force-pushed the feat/tiered-storage-dev branch from d4ab700 to 90f4881 Compare June 17, 2026 07:51

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 90f4881a6d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kv_cache_manager/service/admin_service_impl.cc Outdated
Comment thread kv_cache_manager/config/migration_strategy.cc Outdated
Comment thread kv_cache_manager/manager/migration_manager.cc
Comment thread kv_cache_manager/manager/cache_reclaimer.cc
@shaohuaxi shaohuaxi force-pushed the feat/tiered-storage-dev branch from 90f4881 to 808e649 Compare June 18, 2026 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai reviewed AI has reviewed this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants