Skip to content

[manager] prevent reclaimer over-eviction by tracking in-flight delete bytes and key count#161

Draft
oldsharp wants to merge 4 commits into
mainfrom
rc/feat/cache_reclaimer-eviction-overcommit
Draft

[manager] prevent reclaimer over-eviction by tracking in-flight delete bytes and key count#161
oldsharp wants to merge 4 commits into
mainfrom
rc/feat/cache_reclaimer-eviction-overcommit

Conversation

@oldsharp

Copy link
Copy Markdown
Collaborator

No description provided.

oldsharp added 3 commits May 20, 2026 10:03
…e bytes

Context

`CacheReclaimer::ReclaimCron()`
(`kv_cache_manager/manager/cache_reclaimer.cc:655-709`) wakes up every
`sleep_interval_ms` (default 100ms) and triggers `TryReclaimOnGroup()`
when usage exceeds the configured water level.  When triggered, it
samples keys, picks a batch (`batching_size_`, default 100 keys), then
submits a `CacheLocationDelRequest` to `SchedulePlanExecutor::Submit()`
(`kv_cache_manager/manager/schedule_plan_executor.cc:369`) with
`request.delay = delay_before_delete_ms` (default 1000ms — see
`package/etc/default_startup_config.json:40`).

`SchedulePlanExecutor::SubmitRaw()` enqueues the actual deletion into a
priority queue ordered by `execute_time`
(`schedule_plan_executor.cc:233-248`). The storage usage counter
(`StorageUsageData` in `kv_cache_manager/meta/storage_usage_data.h`) is
only decremented when `MetaSearcher::BatchCADLocationStatus` runs inside
`DoLocationDelTask` (`schedule_plan_executor.cc:120-231`), which happens
**after** the delay expires.

Over-eviction bug

During the delay window (≥1s), the reclaimer cron runs ≥10 more times.
Each iteration still observes `group_used_percentage > threshold`
because the underlying counter has not yet been decremented. So it keeps
submitting more delete batches.  When all queued deletes finally fire,
the cache can be emptied far below the water mark — in extreme cases,
completely cleared.

Goal

Account for "bytes already submitted for deletion but not yet executed"
when checking the water mark, so subsequent cron iterations do not pile
on redundant delete requests.

Approach

Track per-instance-group + per-storage-type **in-flight delete bytes**
inside `CacheReclaimer`. Increment when `SubmitDelReq()` submits a
batch, decrement when `HandleDelRes()` reaps the future. Subtract these
in-flight bytes from observed usage before comparing against water
levels.

The byte size for each sampled location is already available in the
URI's `size` parameter (extracted via
`DataStorageUri::GetParamAs<uint64_t>("size", ...)`).  The same
extraction is already done on write
(`kv_cache_manager/manager/meta_searcher.cc:424-448`) and on delete
(`meta_searcher.cc:676-680, 766-769`), so we reuse this pattern.

Metrics

Add a gauge `in_flight_del_bytes` and `in_flight_del_bytes_by_type`
alongside the existing `block_submit_count` / `block_del_count`
registered around `cache_reclaimer.cc:501-504`, for observability.
`GetWaterLevelExceed()` at
`kv_cache_manager/manager/cache_reclaimer.cc:602-621` still uses the raw
`data->grp_used_key_cnt_` (sourced from `MetaIndexer::GetKeyCount()` at
`kv_cache_manager/meta/meta_indexer.cc:772`, i.e.  `key_count_.load()`).
`key_count_` is decremented inside `AdjustKeyCountMeta()`
(`meta_indexer.cc:898-907`), which only runs once
`MetaIndexer::ReadModifyWriteLocation` fires inside `DoLocationDelTask`
— i.e. **after** the `delay_before_delete_ms` delay. The same
cron-pile-up race that drove the byte fix therefore still drains the
cache when the key-count threshold trips first.

Approach

Mirror the byte fix, but for predicted key removals. The new metric is a
single `std::uint64_t` per instance group (no per-type breakdown is
needed because `key_count_` itself is per-instance, not per-type).

Critical subtlety: `key_count_` is decremented by `delete_success_count`
from `MetaIndexer::ExecuteRmwDelete` (`meta_indexer.cc:298-327`). For
the location-delete path the reclaimer uses, `delete_success_count ==
reclaimed_count` = number of keys whose **block became fully empty**
(every location removed). The existing `blk_count` in `SubmitDelReq()`
(lines 1119-1126) counts blocks with at least one location to delete and
therefore over-counts.

The right metric is "predicted keys to be removed" = number of blocks in
this batch where every currently-mapped location lands in the to-delete
set. This is computable inside `FilterLocID()` at the point where
`loc_id_vec` is built per block (around `cache_reclaimer.cc:1059-1100`),
via `!loc_id_vec.empty() && loc_id_vec.size() == loc_map.size()`. The
existing eligibility filter (CLS_SERVING or orphaned CLS_WRITING)
naturally fails this comparison whenever any non-eligible location
remains on the block, which is the correct outcome.
@oldsharp oldsharp marked this pull request as draft May 20, 2026 13:02
@oldsharp oldsharp changed the title prevent reclaimer over-eviction by tracking in-flight delete bytes and key count [manager] prevent reclaimer over-eviction by tracking in-flight delete bytes and key count May 20, 2026
@oldsharp oldsharp self-assigned this May 20, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 708dda5b2f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +197 to +202
std::uint64_t total = 0;
for (std::size_t i = 0; i != slot.size(); ++i) {
slot[i] += by_type[i];
total += slot[i];
}
METRICS_(cache_reclaimer, in_flight_del_bytes) = static_cast<double>(total);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Aggregate in-flight byte gauge across all groups

AddInFlightDelBytes sets in_flight_del_bytes using only the updated group's subtotal, so in multi-instance-group runs each add can overwrite the gauge with a partial value while other groups still have pending deletes. This makes the gauge drift low until a later subtraction recomputes the global sum, which breaks backlog observability and any alerting based on this metric. Recompute the gauge from in_flight_del_bytes_by_group_ in this add path (like SubInFlightDelBytes already does) to keep the metric accurate.

Useful? React with 👍 / 👎.

@qoderai qoderai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Review Summary

Nice fix for a real production problem — the "cron piling on redundant delete batches during the delay window" scenario is well-articulated and the in-flight tracking approach is a clean solution. The pairing of Add-at-submit / Sub-at-reap with saturating subtraction is solid design.

🛡️ Key Risks & Issues

  • Metrics gauge bug in AddInFlightDelBytes: The gauge reports only the current group's byte total rather than the global sum across all groups. All three sibling functions (SubInFlightDelBytes, AddInFlightDelKeys, SubInFlightDelKeys) correctly iterate all groups. This inconsistency will cause dashboards to underreport in multi-group deployments. See inline comment.

  • Stuck-future starvation risk: If a DeleteHandler future never resolves, its in-flight bytes/keys remain in the tracking maps indefinitely, eventually suppressing eviction entirely for that group. Not a new problem, but the severity is amplified by this PR since in-flight counters now directly suppress the water-level trigger. See inline comment for a potential mitigation.

🧪 Verification Advice

  • The unit test TestTriggerReclaimingInFlightKeys validates the key-count suppression path well. Consider adding a parallel TestTriggerReclaimingInFlightBytes that exercises the byte-threshold suppression, to ensure the adjusted_grp_used_byte_sz logic has equivalent unit-level coverage.
  • TestFilterLocIDPredictedKeys constructs locations with specific size= parameters (64, 32) but never asserts out_bytes_by_type values. Adding assertions like ASSERT_EQ(96u, out_bytes[NFS_IDX]) in scenario 1 would catch any bugs in the compute_loc_size lambda.
  • The existing TestHandleDelRes* tests pass zero bytes_by_type / keys_in_flight. A test that pre-seeds via AddInFlightDelBytes/AddInFlightDelKeys, resolves the future, then verifies the counters drain to zero after HandleDelRes would close the lifecycle loop at the unit level.

💡 Thoughts & Suggestions

  • The integration tests are well-designed with appropriate tolerance (allowing up to 2 batches). The 5-second sleep after a 3-second delay is adequate.
  • The predicted-keys heuristic (loc_id_vec.size() == loc_map.size()) is conservatively correct — if the map ever contains null entries that the loop skips, the comparison will under-predict rather than over-predict. This means the fix might occasionally let one extra batch through but won't over-suppress. Good safety direction.
  • Overall the code is well-commented and the threading invariant (single reclaimer thread) is clearly documented.

🤖 Generated by QoderView workflow run

Comment on lines +197 to +202
std::uint64_t total = 0;
for (std::size_t i = 0; i != slot.size(); ++i) {
slot[i] += by_type[i];
total += slot[i];
}
METRICS_(cache_reclaimer, in_flight_del_bytes) = static_cast<double>(total);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gauge computation here sums only the current group's slot values (total += slot[i]), which reports a single-group total rather than the global in-flight bytes across all instance groups. Compare with SubInFlightDelBytes (line 226-233), AddInFlightDelKeys (line 250-255), and SubInFlightDelKeys (line 271-276), all of which iterate in_flight_del_bytes_by_group_ / in_flight_del_keys_by_group_ across ALL groups before setting the gauge, with the explicit comment "recompute the gauge as the sum across all groups".\n\nIn a multi-group deployment, AddInFlightDelBytes will underreport the gauge (it will jump to the correct value only after the next Sub call forces a full re-sum). This is consistent enough to be confusing on dashboards. Should we apply the same all-groups iteration pattern here for consistency?\n\ncpp\n// recompute the gauge as the sum across all groups\nstd::uint64_t total = 0;\nfor (const auto &[_, arr] : in_flight_del_bytes_by_group_) {\n for (const auto v : arr) {\n total += v;\n }\n}\nMETRICS_(cache_reclaimer, in_flight_del_bytes) = static_cast<double>(total);\n


🤖 Generated by QoderFix in Qoder

@github-actions github-actions Bot added the ai reviewed AI has reviewed this PR label May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai reviewed AI has reviewed this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant