[optimizer] support theoretical hit rate statistical analysis in KVCM… by YoungRX · Pull Request #210 · alibaba/tair-kvcache

YoungRX · 2026-06-17T05:02:53Z

[optimizer] 新增 online optimizer 模块，支持理论命中率在线统计分析

新增 online_optimizer/ 模块（59 文件，+7842 行）
- 独立于 KVCM Manager 运行，不依赖 data_storage / RegistryManager 等重型组件
- 通过 TraceQuery 接口接收推理引擎的 block key 序列，模拟缓存行为，统计理论命中率
CacheIndexer 索引层：3 种索引实现 + TTL 包装器
- FenwickCacheIndexer：基于 Fenwick Tree，O(log n) 查询/更新
- BSTCacheIndexer：基于 std::map 平衡树
- LRUCacheIndexer：基于 LRU 链表
- TTLCacheIndexerWrapper：为任意 indexer 叠加 TTL 过期驱逐
- 支持多容量档位（capacity_gb）前缀命中统计 + hit-age 分桶
OnlineOptimizerManager：核心编排层
- InstanceGroup CRUD + Instance 注册/注销/恢复
- TraceQuery：接收 block_keys，驱动 indexer 计算每容量档位命中率
- 持久化 InstanceGroup / InstanceInfo 到 RegistryBackend（Redis/Local）
服务层：gRPC + HTTP 双协议
- optimizer_service.proto：定义 11 个 RPC（InstanceGroup CRUD、Register/Remove/GetInstance、TraceQuery、ListInstances、ResetStats）
- OptimizerServiceImpl：请求路由与参数校验
- optimizer_service_grpc / optimizer_service_http：协议适配
Metrics：Prometheus 指标上报
- OptimizerMetricsCollector：采集 per-instance 命中率、unique keys、驱逐数、内存用量等
- OptimizerMetricsReporter：周期性聚合并通过 KMonitor/Prometheus 上报
部署与打包
- online_optimizer_server_main：独立可执行入口
- default_optimizer_config.json + start_optimizer_server.sh：默认配置与启动脚本
- Dockerfile.optimizer：独立 Docker 镜像
测试：8 个 UT + 1 个集成测试
- 覆盖 indexer（Fenwick/BST/TTL）、manager、service_impl、metrics_reporter、server_config、optimizer_config
- 集成测试验证完整 RegisterInstance → TraceQuery → ListInstances 流程

… online optimizer

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b752876e5d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-17T05:12:06Z

+    state->indexer = CacheIndexerFactory::CreateCacheIndexer(instance_group.indexer_type(), instance_group.max_key_count(),
+                                        capacity_gb, size_full_only, size_full_linear, linear_step,
+                                        instance_group.ttl_seconds());


Reject invalid indexers before accepting registrations

When an instance group is created or updated with an unsupported indexer_type (the service path does not call ValidateRequiredFields), CreateCacheIndexer returns nullptr. This assignment stores that null indexer and RegisterInstance still returns OK, so the next TraceQuery, ListInstances, or ResetStats dereferences state->indexer and can crash the optimizer server. Please either validate groups before saving them or fail registration when the factory returns null.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-17T05:12:06Z

+        // Rollback persistence on failure
+        if (registry_manager_) {
+            registry_manager_->DeleteInstanceInfo(instance_id);


Preserve existing registrations on failed re-register

If an existing instance is re-registered and RegisterInstanceInternal then fails (for example because the new request has empty specs or the group is disabled), this rollback deletes instance_id from the persistent registry after SaveInstanceInfo has already overwritten the previous good record. The old in-memory state continues running until restart, but recovery will lose the instance, so a failed admin call can silently remove an existing registration from persistence; validate/build before overwriting or restore the previous value on rollback.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-17T05:12:06Z

+    if (logical_time_ >= total_slots_) {
+        DoCompact();


Avoid compacting fenwick state with stale access entries

When an existing key triggers compaction after the logical clock reaches total_slots_, its old timestamp has already been removed from fenwick_/reverse_map_ but last_access_ still contains that old timestamp. DoCompact() therefore re-adds this same key at the stale timestamp and the code then inserts it again at the new timestamp, leaving duplicate active entries and corrupting stack distances/evictions for long-running fenwick_lru instances; remove or update the last_access_ entry before compacting, or compact after installing the new timestamp.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-17T05:12:06Z

+    if (registry_manager_) {
+        registry_manager_->DeleteInstanceInfo(instance_id);


Propagate failed persistent instance deletes

When a registry backend is configured and DeleteInstanceInfo fails (for example a Redis/local storage error), this path has already erased the in-memory instance but still returns OK and leaves the old instance record in persistent storage. The instance disappears immediately but is recovered after the next restart, so removals are not durable under storage errors; handle the delete error or roll back the in-memory erase.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-17T05:12:06Z

+        }
+    }
+
+    instance_groups_.erase(group_name);


Reject removing groups that still own instances

If an admin removes a group while instances in that group are still registered, this erases only the group record; the instance records and in-memory instances remain. On restart recovery then skips those persisted instances because their group no longer exists, so this call silently drops active registrations after a restart; reject the removal when ListInstanceInfos(group_name) is non-empty or remove the dependent instances consistently.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-17T05:12:06Z

+
+    auto registry = manager_->registry_manager();
+    auto group = ConvertProtoToInstanceGroup(request->instance_group());
+    ErrorCode ec = registry ? registry->UpdateInstanceGroup(group) : EC_ERROR;


Apply group updates to active instances

Updating a group only changes the registry entry, while each registered InstanceState keeps its own copied instance_group from registration. For active instances, changes to capacity tiers, TTL, or indexer type therefore do not affect TraceQuery/metrics until the instance is re-registered or the server restarts, and recovery will then rebuild with different behavior; either rebuild affected instances on update or reject updates while instances are registered.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-17T05:12:06Z

+        }
+    });
+
+    return true;


Surface HTTP startup failures from Start

If the HTTP port is already in use or async_start() fails, CoroHttpService::Start returns false inside this background thread, but InitHttpServer still returns true and OnlineOptimizerServer::Start reports success with only the gRPC side running and no HTTP/metrics endpoint. Propagate the thread's startup result before returning from Start, or start the HTTP server synchronously through the bind phase.

Useful? React with 👍 / 👎.

qoderai

Review Summary

Thanks for this large and well-structured contribution — the online optimizer module is clearly separated from the rest of KVCM and the indexer/manager/service layering is easy to follow.

Key Risks & Issues

Group validation is missing at the service layer. ConvertProtoToInstanceGroup/CreateInstanceGroup/UpdateInstanceGroup persist an OptimizerInstanceGroup without calling ValidateRequiredFields, so empty names, unsupported indexer_type, or enabled=true with no capacities can be saved. This feeds the null-indexer crash already noted for RegisterInstance.
ResetStats can also crash on a bad indexer type. It rebuilds the indexer with CreateCacheIndexer and stores the result unchecked, so the same null-dereference path exists after a stats reset.
LRU indexer charge logic looks incorrect. LruCacheIndexer::ProcessKeys derives charge from the loop index i (position inside the current query batch) rather than from the key's position in the global block sequence. The same key can therefore be charged differently across queries, and a query that does not start on a full-block boundary will simulate the wrong capacity.
Registry recovery has synchronization and staleness bugs. LoadRecoveryData writes instance_groups_/instance_infos_ without taking mutex_, and it never clears the output vectors or the in-memory maps, so stale entries survive a second recovery call.
Metrics cardinality and memory leaks. Per-query Prometheus gauges are tagged with client_ip, which is unbounded in production. Interval gauges keyed by instance_id/capacity_gb/age_bucket are created but never removed when instances go away, so MetricsRegistry grows monotonically.
KMonitor lifecycle is incomplete. InitKmonitor() starts the global KMonitor factory but neither the reporter destructor nor OnlineOptimizerServer::Stop() stops it, risking background-thread leaks. The return value of InitKmonitor() is also ignored.
Admin interfaces are plaintext and unauthenticated. gRPC uses InsecureServerCredentials() on 0.0.0.0, and the HTTP endpoints perform no auth/ACL checks before mutating groups and instances.
Recovery failures are swallowed. OnlineOptimizerServer::Init logs a warning when manager_->Recover() fails and continues; combined with the non-atomic read-modify-write pattern in LoadAndSave/LoadAndDelete, this makes multi-process or storage-error scenarios easy to corrupt.

Verification Advice

Add unit tests that create groups with empty names / invalid indexer_type and assert INVALID_ARGUMENT.
Add a test for ResetStats after changing a group's indexer_type to an invalid value.
Run ReportPerQuery with a large, varying set of client_ips and measure MetricsRegistry memory.
Register then remove an instance and verify that /metrics no longer exposes its series.
Test recovery after manually creating a group and after a storage error.

Thoughts & Suggestions

The module's shape is solid. The highest-value follow-ups would be (1) validating all configuration at the service boundary, (2) fixing the LRU charge model, and (3) bounding or cleaning up metric labels before running this in production.

🤖 Generated by Qoder

qoderai · 2026-06-17T06:53:29Z

+void OptimizerServiceImpl::CreateInstanceGroup(RequestContext *request_context,
+                                                const proto::optimizer::CreateInstanceGroupRequest *request,
+                                                proto::optimizer::CommonResponse *response) {
+    request_context->set_api_name("CreateInstanceGroup");
+    OptimizerCallGuard guard(request_context, metrics_reporter_.get());
+
+    auto registry = manager_->registry_manager();
+    auto group = ConvertProtoToInstanceGroup(request->instance_group());
+    ErrorCode ec = registry ? registry->CreateInstanceGroup(group) : EC_ERROR;


The service layer converts the protobuf group directly into an OptimizerInstanceGroup and persists it without calling ValidateRequiredFields. This allows groups with empty names, unsupported indexer_type, negative max_key_count, or enabled=true with no capacity_gb to be saved. Those invalid groups later cause registration or TraceQuery failures (or, for invalid indexer_type, a null indexer crash). Please validate required fields in CreateInstanceGroup and UpdateInstanceGroup before calling the registry, and surface an invalid-argument error to the caller.

🤖 Generated by Qoder

qoderai · 2026-06-17T06:53:47Z

+    state->indexer = CacheIndexerFactory::CreateCacheIndexer(state->instance_group->indexer_type(),
+                                        state->instance_group->max_key_count(),
+                                        state->instance_group->capacity_gb(),
+                                        state->size_full_only,
+                                        state->size_full_linear,
+                                        state->linear_step,
+                                        state->instance_group->ttl_seconds());


ResetStats rebuilds the indexer with CacheIndexerFactory::CreateCacheIndexer but never checks the result. If the group's indexer_type is invalid or the factory otherwise returns nullptr, the next TraceQuery, ListInstances, or ResetStats dereferences state->indexer and crashes. Please guard the factory result and return an error if the indexer cannot be recreated.

🤖 Generated by Qoder

qoderai · 2026-06-17T06:53:59Z

+        int64_t charge;
+        if (linear_step_ <= 1) {
+            charge = size_full_linear_;
+        } else if (i == total_keys - 1) {
+            charge = size_full_linear_;
+        } else if (i % linear_step_ == 0) {
+            charge = size_full_linear_;
+        } else {
+            charge = size_full_only_;
+        }


The per-key cache charge is decided by the loop index i (position within the current query batch) rather than by the key's position in the global block sequence. Because the same key can appear at different offsets across queries, it may be inserted with different charges, and the capacity simulation for full-only vs full-linear blocks becomes incorrect when a query does not start on a full-block boundary. Consider deriving the charge from the key's global sequence index or the instance's linear_step pattern instead of the local batch index.

🤖 Generated by Qoder

qoderai · 2026-06-17T06:54:28Z

+        data.instance_groups.push_back(group);
+        instance_groups_[name] = group;
+    }


LoadRecoveryData mutates instance_groups_ here (and instance_infos_ below) without holding mutex_, unlike every other mutating method in this class. It also appends to the output vectors and inserts into the member maps without clearing them first, so a second call can leave stale entries that no longer exist in storage. Please take std::unique_lock<std::shared_mutex>, clear data and the member maps at the start, and then load from storage.

🤖 Generated by Qoder

qoderai · 2026-06-17T06:54:45Z

+            double hit_rate = static_cast<double>(info.hit_count) / static_cast<double>(p->total_blocks());
+            std::string cap_str = std::to_string(info.capacity_gb);
+
+            MetricsTags prom_tags = {{"instance_id", instance_id}, {"client_ip", client_ip}, {"capacity_gb", cap_str}};


ReportPerQuery creates Prometheus gauges tagged with client_ip for every request. In production the client IP set is unbounded, so the label cardinality grows without limit and MetricsRegistry memory will leak. Consider dropping client_ip from per-query gauges, aggregating to a bounded set of buckets, or implementing an eviction policy for stale label combinations.

🤖 Generated by Qoder

qoderai · 2026-06-17T06:55:08Z

+            std::string bucket_label = bucket.threshold_seconds > 0
+                ? std::to_string(bucket.threshold_seconds) + "s"
+                : "inf";
+            MetricsTags bucket_tags = {{"instance_id", s.instance_id}, {"age_bucket", bucket_label}};


ReportInterval registers Prometheus gauges keyed by instance_id, capacity_gb, and age_bucket, but there is no cleanup when an instance is removed or its configuration changes. Stale series remain in MetricsRegistry forever, leaking memory and continuing to expose obsolete values on /metrics. Consider tracking previously registered series and deleting gauges for removed instances or labels.

🤖 Generated by Qoder

[optimizer] support theoretical hit rate statistical analysis in KVCM…

b752876

… online optimizer

YoungRX requested review from DWAE86, PerryZZ, Tyndalllll, charpty, oldsharp, wangxiyu191 and xiqi-lq as code owners June 17, 2026 05:02

chatgpt-codex-connector Bot reviewed Jun 17, 2026

View reviewed changes

qoderai Bot reviewed Jun 17, 2026

View reviewed changes

github-actions Bot added the ai reviewed AI has reviewed this PR label Jun 17, 2026

		if (registry_manager_) {
		registry_manager_->DeleteInstanceInfo(instance_id);

Conversation

YoungRX commented Jun 17, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

qoderai Bot left a comment

Choose a reason for hiding this comment

Review Summary

Key Risks & Issues

Verification Advice

Thoughts & Suggestions

Uh oh!

qoderai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

qoderai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

qoderai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

qoderai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

qoderai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

qoderai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant