[manager] prevent reclaimer over-eviction by tracking in-flight delete bytes and key count by oldsharp · Pull Request #161 · alibaba/tair-kvcache

oldsharp · 2026-05-20T13:02:27Z

No description provided.

…e bytes Context `CacheReclaimer::ReclaimCron()` (`kv_cache_manager/manager/cache_reclaimer.cc:655-709`) wakes up every `sleep_interval_ms` (default 100ms) and triggers `TryReclaimOnGroup()` when usage exceeds the configured water level. When triggered, it samples keys, picks a batch (`batching_size_`, default 100 keys), then submits a `CacheLocationDelRequest` to `SchedulePlanExecutor::Submit()` (`kv_cache_manager/manager/schedule_plan_executor.cc:369`) with `request.delay = delay_before_delete_ms` (default 1000ms — see `package/etc/default_startup_config.json:40`). `SchedulePlanExecutor::SubmitRaw()` enqueues the actual deletion into a priority queue ordered by `execute_time` (`schedule_plan_executor.cc:233-248`). The storage usage counter (`StorageUsageData` in `kv_cache_manager/meta/storage_usage_data.h`) is only decremented when `MetaSearcher::BatchCADLocationStatus` runs inside `DoLocationDelTask` (`schedule_plan_executor.cc:120-231`), which happens **after** the delay expires. Over-eviction bug During the delay window (≥1s), the reclaimer cron runs ≥10 more times. Each iteration still observes `group_used_percentage > threshold` because the underlying counter has not yet been decremented. So it keeps submitting more delete batches. When all queued deletes finally fire, the cache can be emptied far below the water mark — in extreme cases, completely cleared. Goal Account for "bytes already submitted for deletion but not yet executed" when checking the water mark, so subsequent cron iterations do not pile on redundant delete requests. Approach Track per-instance-group + per-storage-type **in-flight delete bytes** inside `CacheReclaimer`. Increment when `SubmitDelReq()` submits a batch, decrement when `HandleDelRes()` reaps the future. Subtract these in-flight bytes from observed usage before comparing against water levels. The byte size for each sampled location is already available in the URI's `size` parameter (extracted via `DataStorageUri::GetParamAs<uint64_t>("size", ...)`). The same extraction is already done on write (`kv_cache_manager/manager/meta_searcher.cc:424-448`) and on delete (`meta_searcher.cc:676-680, 766-769`), so we reuse this pattern. Metrics Add a gauge `in_flight_del_bytes` and `in_flight_del_bytes_by_type` alongside the existing `block_submit_count` / `block_del_count` registered around `cache_reclaimer.cc:501-504`, for observability.

`GetWaterLevelExceed()` at `kv_cache_manager/manager/cache_reclaimer.cc:602-621` still uses the raw `data->grp_used_key_cnt_` (sourced from `MetaIndexer::GetKeyCount()` at `kv_cache_manager/meta/meta_indexer.cc:772`, i.e. `key_count_.load()`). `key_count_` is decremented inside `AdjustKeyCountMeta()` (`meta_indexer.cc:898-907`), which only runs once `MetaIndexer::ReadModifyWriteLocation` fires inside `DoLocationDelTask` — i.e. **after** the `delay_before_delete_ms` delay. The same cron-pile-up race that drove the byte fix therefore still drains the cache when the key-count threshold trips first. Approach Mirror the byte fix, but for predicted key removals. The new metric is a single `std::uint64_t` per instance group (no per-type breakdown is needed because `key_count_` itself is per-instance, not per-type). Critical subtlety: `key_count_` is decremented by `delete_success_count` from `MetaIndexer::ExecuteRmwDelete` (`meta_indexer.cc:298-327`). For the location-delete path the reclaimer uses, `delete_success_count == reclaimed_count` = number of keys whose **block became fully empty** (every location removed). The existing `blk_count` in `SubmitDelReq()` (lines 1119-1126) counts blocks with at least one location to delete and therefore over-counts. The right metric is "predicted keys to be removed" = number of blocks in this batch where every currently-mapped location lands in the to-delete set. This is computable inside `FilterLocID()` at the point where `loc_id_vec` is built per block (around `cache_reclaimer.cc:1059-1100`), via `!loc_id_vec.empty() && loc_id_vec.size() == loc_map.size()`. The existing eligibility filter (CLS_SERVING or orphaned CLS_WRITING) naturally fails this comparison whenever any non-eligible location remains on the block, which is the correct outcome.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 708dda5b2f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-20T13:05:49Z

+    std::uint64_t total = 0;
+    for (std::size_t i = 0; i != slot.size(); ++i) {
+        slot[i] += by_type[i];
+        total += slot[i];
+    }
+    METRICS_(cache_reclaimer, in_flight_del_bytes) = static_cast<double>(total);


Aggregate in-flight byte gauge across all groups

AddInFlightDelBytes sets in_flight_del_bytes using only the updated group's subtotal, so in multi-instance-group runs each add can overwrite the gauge with a partial value while other groups still have pending deletes. This makes the gauge drift low until a later subtraction recomputes the global sum, which breaks backlog observability and any alerting based on this metric. Recompute the gauge from in_flight_del_bytes_by_group_ in this add path (like SubInFlightDelBytes already does) to keep the metric accurate.

Useful? React with 👍 / 👎.

qoderai

👋 Review Summary

Nice fix for a real production problem — the "cron piling on redundant delete batches during the delay window" scenario is well-articulated and the in-flight tracking approach is a clean solution. The pairing of Add-at-submit / Sub-at-reap with saturating subtraction is solid design.

🛡️ Key Risks & Issues

Metrics gauge bug in AddInFlightDelBytes: The gauge reports only the current group's byte total rather than the global sum across all groups. All three sibling functions (SubInFlightDelBytes, AddInFlightDelKeys, SubInFlightDelKeys) correctly iterate all groups. This inconsistency will cause dashboards to underreport in multi-group deployments. See inline comment.
Stuck-future starvation risk: If a DeleteHandler future never resolves, its in-flight bytes/keys remain in the tracking maps indefinitely, eventually suppressing eviction entirely for that group. Not a new problem, but the severity is amplified by this PR since in-flight counters now directly suppress the water-level trigger. See inline comment for a potential mitigation.

🧪 Verification Advice

The unit test TestTriggerReclaimingInFlightKeys validates the key-count suppression path well. Consider adding a parallel TestTriggerReclaimingInFlightBytes that exercises the byte-threshold suppression, to ensure the adjusted_grp_used_byte_sz logic has equivalent unit-level coverage.
TestFilterLocIDPredictedKeys constructs locations with specific size= parameters (64, 32) but never asserts out_bytes_by_type values. Adding assertions like ASSERT_EQ(96u, out_bytes[NFS_IDX]) in scenario 1 would catch any bugs in the compute_loc_size lambda.
The existing TestHandleDelRes* tests pass zero bytes_by_type / keys_in_flight. A test that pre-seeds via AddInFlightDelBytes/AddInFlightDelKeys, resolves the future, then verifies the counters drain to zero after HandleDelRes would close the lifecycle loop at the unit level.

💡 Thoughts & Suggestions

The integration tests are well-designed with appropriate tolerance (allowing up to 2 batches). The 5-second sleep after a 3-second delay is adequate.
The predicted-keys heuristic (loc_id_vec.size() == loc_map.size()) is conservatively correct — if the map ever contains null entries that the loop skips, the comparison will under-predict rather than over-predict. This means the fix might occasionally let one extra batch through but won't over-suppress. Good safety direction.
Overall the code is well-commented and the threading invariant (single reclaimer thread) is clearly documented.

🤖 Generated by Qoder • View workflow run

qoderai · 2026-05-20T13:08:49Z

+    std::uint64_t total = 0;
+    for (std::size_t i = 0; i != slot.size(); ++i) {
+        slot[i] += by_type[i];
+        total += slot[i];
+    }
+    METRICS_(cache_reclaimer, in_flight_del_bytes) = static_cast<double>(total);


The gauge computation here sums only the current group's slot values (total += slot[i]), which reports a single-group total rather than the global in-flight bytes across all instance groups. Compare with SubInFlightDelBytes (line 226-233), AddInFlightDelKeys (line 250-255), and SubInFlightDelKeys (line 271-276), all of which iterate in_flight_del_bytes_by_group_ / in_flight_del_keys_by_group_ across ALL groups before setting the gauge, with the explicit comment "recompute the gauge as the sum across all groups".\n\nIn a multi-group deployment, AddInFlightDelBytes will underreport the gauge (it will jump to the correct value only after the next Sub call forces a full re-sum). This is consistent enough to be confusing on dashboards. Should we apply the same all-groups iteration pattern here for consistency?\n\ncpp\n// recompute the gauge as the sum across all groups\nstd::uint64_t total = 0;\nfor (const auto &[_, arr] : in_flight_del_bytes_by_group_) {\n for (const auto v : arr) {\n total += v;\n }\n}\nMETRICS_(cache_reclaimer, in_flight_del_bytes) = static_cast<double>(total);\n

🤖 Generated by Qoder • Fix in Qoder

oldsharp added 3 commits May 20, 2026 10:03

wip: add integration test case

aefef19

oldsharp marked this pull request as draft May 20, 2026 13:02

oldsharp changed the title ~~prevent reclaimer over-eviction by tracking in-flight delete bytes and key count~~ [manager] prevent reclaimer over-eviction by tracking in-flight delete bytes and key count May 20, 2026

oldsharp self-assigned this May 20, 2026

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

qoderai Bot reviewed May 20, 2026

View reviewed changes

github-actions Bot added the ai reviewed AI has reviewed this PR label May 20, 2026

style fixup

4a3de3e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[manager] prevent reclaimer over-eviction by tracking in-flight delete bytes and key count#161

[manager] prevent reclaimer over-eviction by tracking in-flight delete bytes and key count#161
oldsharp wants to merge 4 commits into
mainfrom
rc/feat/cache_reclaimer-eviction-overcommit

oldsharp commented May 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Uh oh!

qoderai Bot left a comment

Uh oh!

qoderai Bot May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oldsharp commented May 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

qoderai Bot left a comment

Choose a reason for hiding this comment

👋 Review Summary

🛡️ Key Risks & Issues

🧪 Verification Advice

💡 Thoughts & Suggestions

Uh oh!

qoderai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant