[manager] prevent reclaimer over-eviction by tracking in-flight delete bytes and key count#161
[manager] prevent reclaimer over-eviction by tracking in-flight delete bytes and key count#161oldsharp wants to merge 4 commits into
Conversation
…e bytes
Context
`CacheReclaimer::ReclaimCron()`
(`kv_cache_manager/manager/cache_reclaimer.cc:655-709`) wakes up every
`sleep_interval_ms` (default 100ms) and triggers `TryReclaimOnGroup()`
when usage exceeds the configured water level. When triggered, it
samples keys, picks a batch (`batching_size_`, default 100 keys), then
submits a `CacheLocationDelRequest` to `SchedulePlanExecutor::Submit()`
(`kv_cache_manager/manager/schedule_plan_executor.cc:369`) with
`request.delay = delay_before_delete_ms` (default 1000ms — see
`package/etc/default_startup_config.json:40`).
`SchedulePlanExecutor::SubmitRaw()` enqueues the actual deletion into a
priority queue ordered by `execute_time`
(`schedule_plan_executor.cc:233-248`). The storage usage counter
(`StorageUsageData` in `kv_cache_manager/meta/storage_usage_data.h`) is
only decremented when `MetaSearcher::BatchCADLocationStatus` runs inside
`DoLocationDelTask` (`schedule_plan_executor.cc:120-231`), which happens
**after** the delay expires.
Over-eviction bug
During the delay window (≥1s), the reclaimer cron runs ≥10 more times.
Each iteration still observes `group_used_percentage > threshold`
because the underlying counter has not yet been decremented. So it keeps
submitting more delete batches. When all queued deletes finally fire,
the cache can be emptied far below the water mark — in extreme cases,
completely cleared.
Goal
Account for "bytes already submitted for deletion but not yet executed"
when checking the water mark, so subsequent cron iterations do not pile
on redundant delete requests.
Approach
Track per-instance-group + per-storage-type **in-flight delete bytes**
inside `CacheReclaimer`. Increment when `SubmitDelReq()` submits a
batch, decrement when `HandleDelRes()` reaps the future. Subtract these
in-flight bytes from observed usage before comparing against water
levels.
The byte size for each sampled location is already available in the
URI's `size` parameter (extracted via
`DataStorageUri::GetParamAs<uint64_t>("size", ...)`). The same
extraction is already done on write
(`kv_cache_manager/manager/meta_searcher.cc:424-448`) and on delete
(`meta_searcher.cc:676-680, 766-769`), so we reuse this pattern.
Metrics
Add a gauge `in_flight_del_bytes` and `in_flight_del_bytes_by_type`
alongside the existing `block_submit_count` / `block_del_count`
registered around `cache_reclaimer.cc:501-504`, for observability.
`GetWaterLevelExceed()` at `kv_cache_manager/manager/cache_reclaimer.cc:602-621` still uses the raw `data->grp_used_key_cnt_` (sourced from `MetaIndexer::GetKeyCount()` at `kv_cache_manager/meta/meta_indexer.cc:772`, i.e. `key_count_.load()`). `key_count_` is decremented inside `AdjustKeyCountMeta()` (`meta_indexer.cc:898-907`), which only runs once `MetaIndexer::ReadModifyWriteLocation` fires inside `DoLocationDelTask` — i.e. **after** the `delay_before_delete_ms` delay. The same cron-pile-up race that drove the byte fix therefore still drains the cache when the key-count threshold trips first. Approach Mirror the byte fix, but for predicted key removals. The new metric is a single `std::uint64_t` per instance group (no per-type breakdown is needed because `key_count_` itself is per-instance, not per-type). Critical subtlety: `key_count_` is decremented by `delete_success_count` from `MetaIndexer::ExecuteRmwDelete` (`meta_indexer.cc:298-327`). For the location-delete path the reclaimer uses, `delete_success_count == reclaimed_count` = number of keys whose **block became fully empty** (every location removed). The existing `blk_count` in `SubmitDelReq()` (lines 1119-1126) counts blocks with at least one location to delete and therefore over-counts. The right metric is "predicted keys to be removed" = number of blocks in this batch where every currently-mapped location lands in the to-delete set. This is computable inside `FilterLocID()` at the point where `loc_id_vec` is built per block (around `cache_reclaimer.cc:1059-1100`), via `!loc_id_vec.empty() && loc_id_vec.size() == loc_map.size()`. The existing eligibility filter (CLS_SERVING or orphaned CLS_WRITING) naturally fails this comparison whenever any non-eligible location remains on the block, which is the correct outcome.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 708dda5b2f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| std::uint64_t total = 0; | ||
| for (std::size_t i = 0; i != slot.size(); ++i) { | ||
| slot[i] += by_type[i]; | ||
| total += slot[i]; | ||
| } | ||
| METRICS_(cache_reclaimer, in_flight_del_bytes) = static_cast<double>(total); |
There was a problem hiding this comment.
Aggregate in-flight byte gauge across all groups
AddInFlightDelBytes sets in_flight_del_bytes using only the updated group's subtotal, so in multi-instance-group runs each add can overwrite the gauge with a partial value while other groups still have pending deletes. This makes the gauge drift low until a later subtraction recomputes the global sum, which breaks backlog observability and any alerting based on this metric. Recompute the gauge from in_flight_del_bytes_by_group_ in this add path (like SubInFlightDelBytes already does) to keep the metric accurate.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
👋 Review Summary
Nice fix for a real production problem — the "cron piling on redundant delete batches during the delay window" scenario is well-articulated and the in-flight tracking approach is a clean solution. The pairing of Add-at-submit / Sub-at-reap with saturating subtraction is solid design.
🛡️ Key Risks & Issues
-
Metrics gauge bug in
AddInFlightDelBytes: The gauge reports only the current group's byte total rather than the global sum across all groups. All three sibling functions (SubInFlightDelBytes,AddInFlightDelKeys,SubInFlightDelKeys) correctly iterate all groups. This inconsistency will cause dashboards to underreport in multi-group deployments. See inline comment. -
Stuck-future starvation risk: If a
DeleteHandlerfuture never resolves, its in-flight bytes/keys remain in the tracking maps indefinitely, eventually suppressing eviction entirely for that group. Not a new problem, but the severity is amplified by this PR since in-flight counters now directly suppress the water-level trigger. See inline comment for a potential mitigation.
🧪 Verification Advice
- The unit test
TestTriggerReclaimingInFlightKeysvalidates the key-count suppression path well. Consider adding a parallelTestTriggerReclaimingInFlightBytesthat exercises the byte-threshold suppression, to ensure theadjusted_grp_used_byte_szlogic has equivalent unit-level coverage. TestFilterLocIDPredictedKeysconstructs locations with specificsize=parameters (64, 32) but never assertsout_bytes_by_typevalues. Adding assertions likeASSERT_EQ(96u, out_bytes[NFS_IDX])in scenario 1 would catch any bugs in thecompute_loc_sizelambda.- The existing
TestHandleDelRes*tests pass zerobytes_by_type/keys_in_flight. A test that pre-seeds viaAddInFlightDelBytes/AddInFlightDelKeys, resolves the future, then verifies the counters drain to zero afterHandleDelReswould close the lifecycle loop at the unit level.
💡 Thoughts & Suggestions
- The integration tests are well-designed with appropriate tolerance (allowing up to 2 batches). The 5-second sleep after a 3-second delay is adequate.
- The predicted-keys heuristic (
loc_id_vec.size() == loc_map.size()) is conservatively correct — if the map ever contains null entries that the loop skips, the comparison will under-predict rather than over-predict. This means the fix might occasionally let one extra batch through but won't over-suppress. Good safety direction. - Overall the code is well-commented and the threading invariant (single reclaimer thread) is clearly documented.
🤖 Generated by Qoder • View workflow run
| std::uint64_t total = 0; | ||
| for (std::size_t i = 0; i != slot.size(); ++i) { | ||
| slot[i] += by_type[i]; | ||
| total += slot[i]; | ||
| } | ||
| METRICS_(cache_reclaimer, in_flight_del_bytes) = static_cast<double>(total); |
There was a problem hiding this comment.
The gauge computation here sums only the current group's slot values (total += slot[i]), which reports a single-group total rather than the global in-flight bytes across all instance groups. Compare with SubInFlightDelBytes (line 226-233), AddInFlightDelKeys (line 250-255), and SubInFlightDelKeys (line 271-276), all of which iterate in_flight_del_bytes_by_group_ / in_flight_del_keys_by_group_ across ALL groups before setting the gauge, with the explicit comment "recompute the gauge as the sum across all groups".\n\nIn a multi-group deployment, AddInFlightDelBytes will underreport the gauge (it will jump to the correct value only after the next Sub call forces a full re-sum). This is consistent enough to be confusing on dashboards. Should we apply the same all-groups iteration pattern here for consistency?\n\ncpp\n// recompute the gauge as the sum across all groups\nstd::uint64_t total = 0;\nfor (const auto &[_, arr] : in_flight_del_bytes_by_group_) {\n for (const auto v : arr) {\n total += v;\n }\n}\nMETRICS_(cache_reclaimer, in_flight_del_bytes) = static_cast<double>(total);\n
🤖 Generated by Qoder • Fix in Qoder
No description provided.