[optimizer] support theoretical hit rate statistical analysis in KVCM…#210
[optimizer] support theoretical hit rate statistical analysis in KVCM…#210YoungRX wants to merge 1 commit into
Conversation
… online optimizer
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b752876e5d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| state->indexer = CacheIndexerFactory::CreateCacheIndexer(instance_group.indexer_type(), instance_group.max_key_count(), | ||
| capacity_gb, size_full_only, size_full_linear, linear_step, | ||
| instance_group.ttl_seconds()); |
There was a problem hiding this comment.
Reject invalid indexers before accepting registrations
When an instance group is created or updated with an unsupported indexer_type (the service path does not call ValidateRequiredFields), CreateCacheIndexer returns nullptr. This assignment stores that null indexer and RegisterInstance still returns OK, so the next TraceQuery, ListInstances, or ResetStats dereferences state->indexer and can crash the optimizer server. Please either validate groups before saving them or fail registration when the factory returns null.
Useful? React with 👍 / 👎.
| // Rollback persistence on failure | ||
| if (registry_manager_) { | ||
| registry_manager_->DeleteInstanceInfo(instance_id); |
There was a problem hiding this comment.
Preserve existing registrations on failed re-register
If an existing instance is re-registered and RegisterInstanceInternal then fails (for example because the new request has empty specs or the group is disabled), this rollback deletes instance_id from the persistent registry after SaveInstanceInfo has already overwritten the previous good record. The old in-memory state continues running until restart, but recovery will lose the instance, so a failed admin call can silently remove an existing registration from persistence; validate/build before overwriting or restore the previous value on rollback.
Useful? React with 👍 / 👎.
| if (logical_time_ >= total_slots_) { | ||
| DoCompact(); |
There was a problem hiding this comment.
Avoid compacting fenwick state with stale access entries
When an existing key triggers compaction after the logical clock reaches total_slots_, its old timestamp has already been removed from fenwick_/reverse_map_ but last_access_ still contains that old timestamp. DoCompact() therefore re-adds this same key at the stale timestamp and the code then inserts it again at the new timestamp, leaving duplicate active entries and corrupting stack distances/evictions for long-running fenwick_lru instances; remove or update the last_access_ entry before compacting, or compact after installing the new timestamp.
Useful? React with 👍 / 👎.
| if (registry_manager_) { | ||
| registry_manager_->DeleteInstanceInfo(instance_id); |
There was a problem hiding this comment.
Propagate failed persistent instance deletes
When a registry backend is configured and DeleteInstanceInfo fails (for example a Redis/local storage error), this path has already erased the in-memory instance but still returns OK and leaves the old instance record in persistent storage. The instance disappears immediately but is recovered after the next restart, so removals are not durable under storage errors; handle the delete error or roll back the in-memory erase.
Useful? React with 👍 / 👎.
| } | ||
| } | ||
|
|
||
| instance_groups_.erase(group_name); |
There was a problem hiding this comment.
Reject removing groups that still own instances
If an admin removes a group while instances in that group are still registered, this erases only the group record; the instance records and in-memory instances remain. On restart recovery then skips those persisted instances because their group no longer exists, so this call silently drops active registrations after a restart; reject the removal when ListInstanceInfos(group_name) is non-empty or remove the dependent instances consistently.
Useful? React with 👍 / 👎.
|
|
||
| auto registry = manager_->registry_manager(); | ||
| auto group = ConvertProtoToInstanceGroup(request->instance_group()); | ||
| ErrorCode ec = registry ? registry->UpdateInstanceGroup(group) : EC_ERROR; |
There was a problem hiding this comment.
Apply group updates to active instances
Updating a group only changes the registry entry, while each registered InstanceState keeps its own copied instance_group from registration. For active instances, changes to capacity tiers, TTL, or indexer type therefore do not affect TraceQuery/metrics until the instance is re-registered or the server restarts, and recovery will then rebuild with different behavior; either rebuild affected instances on update or reject updates while instances are registered.
Useful? React with 👍 / 👎.
| } | ||
| }); | ||
|
|
||
| return true; |
There was a problem hiding this comment.
Surface HTTP startup failures from Start
If the HTTP port is already in use or async_start() fails, CoroHttpService::Start returns false inside this background thread, but InitHttpServer still returns true and OnlineOptimizerServer::Start reports success with only the gRPC side running and no HTTP/metrics endpoint. Propagate the thread's startup result before returning from Start, or start the HTTP server synchronously through the bind phase.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Review Summary
Thanks for this large and well-structured contribution — the online optimizer module is clearly separated from the rest of KVCM and the indexer/manager/service layering is easy to follow.
Key Risks & Issues
- Group validation is missing at the service layer.
ConvertProtoToInstanceGroup/CreateInstanceGroup/UpdateInstanceGrouppersist anOptimizerInstanceGroupwithout callingValidateRequiredFields, so empty names, unsupportedindexer_type, orenabled=truewith no capacities can be saved. This feeds the null-indexer crash already noted forRegisterInstance. ResetStatscan also crash on a bad indexer type. It rebuilds the indexer withCreateCacheIndexerand stores the result unchecked, so the same null-dereference path exists after a stats reset.- LRU indexer charge logic looks incorrect.
LruCacheIndexer::ProcessKeysderiveschargefrom the loop indexi(position inside the current query batch) rather than from the key's position in the global block sequence. The same key can therefore be charged differently across queries, and a query that does not start on a full-block boundary will simulate the wrong capacity. - Registry recovery has synchronization and staleness bugs.
LoadRecoveryDatawritesinstance_groups_/instance_infos_without takingmutex_, and it never clears the output vectors or the in-memory maps, so stale entries survive a second recovery call. - Metrics cardinality and memory leaks. Per-query Prometheus gauges are tagged with
client_ip, which is unbounded in production. Interval gauges keyed byinstance_id/capacity_gb/age_bucketare created but never removed when instances go away, soMetricsRegistrygrows monotonically. - KMonitor lifecycle is incomplete.
InitKmonitor()starts the global KMonitor factory but neither the reporter destructor norOnlineOptimizerServer::Stop()stops it, risking background-thread leaks. The return value ofInitKmonitor()is also ignored. - Admin interfaces are plaintext and unauthenticated. gRPC uses
InsecureServerCredentials()on0.0.0.0, and the HTTP endpoints perform no auth/ACL checks before mutating groups and instances. - Recovery failures are swallowed.
OnlineOptimizerServer::Initlogs a warning whenmanager_->Recover()fails and continues; combined with the non-atomic read-modify-write pattern inLoadAndSave/LoadAndDelete, this makes multi-process or storage-error scenarios easy to corrupt.
Verification Advice
- Add unit tests that create groups with empty names / invalid
indexer_typeand assertINVALID_ARGUMENT. - Add a test for
ResetStatsafter changing a group'sindexer_typeto an invalid value. - Run
ReportPerQuerywith a large, varying set ofclient_ips and measureMetricsRegistrymemory. - Register then remove an instance and verify that
/metricsno longer exposes its series. - Test recovery after manually creating a group and after a storage error.
Thoughts & Suggestions
The module's shape is solid. The highest-value follow-ups would be (1) validating all configuration at the service boundary, (2) fixing the LRU charge model, and (3) bounding or cleaning up metric labels before running this in production.
🤖 Generated by Qoder
| void OptimizerServiceImpl::CreateInstanceGroup(RequestContext *request_context, | ||
| const proto::optimizer::CreateInstanceGroupRequest *request, | ||
| proto::optimizer::CommonResponse *response) { | ||
| request_context->set_api_name("CreateInstanceGroup"); | ||
| OptimizerCallGuard guard(request_context, metrics_reporter_.get()); | ||
|
|
||
| auto registry = manager_->registry_manager(); | ||
| auto group = ConvertProtoToInstanceGroup(request->instance_group()); | ||
| ErrorCode ec = registry ? registry->CreateInstanceGroup(group) : EC_ERROR; |
There was a problem hiding this comment.
The service layer converts the protobuf group directly into an OptimizerInstanceGroup and persists it without calling ValidateRequiredFields. This allows groups with empty names, unsupported indexer_type, negative max_key_count, or enabled=true with no capacity_gb to be saved. Those invalid groups later cause registration or TraceQuery failures (or, for invalid indexer_type, a null indexer crash). Please validate required fields in CreateInstanceGroup and UpdateInstanceGroup before calling the registry, and surface an invalid-argument error to the caller.
🤖 Generated by Qoder
| state->indexer = CacheIndexerFactory::CreateCacheIndexer(state->instance_group->indexer_type(), | ||
| state->instance_group->max_key_count(), | ||
| state->instance_group->capacity_gb(), | ||
| state->size_full_only, | ||
| state->size_full_linear, | ||
| state->linear_step, | ||
| state->instance_group->ttl_seconds()); |
There was a problem hiding this comment.
ResetStats rebuilds the indexer with CacheIndexerFactory::CreateCacheIndexer but never checks the result. If the group's indexer_type is invalid or the factory otherwise returns nullptr, the next TraceQuery, ListInstances, or ResetStats dereferences state->indexer and crashes. Please guard the factory result and return an error if the indexer cannot be recreated.
🤖 Generated by Qoder
| int64_t charge; | ||
| if (linear_step_ <= 1) { | ||
| charge = size_full_linear_; | ||
| } else if (i == total_keys - 1) { | ||
| charge = size_full_linear_; | ||
| } else if (i % linear_step_ == 0) { | ||
| charge = size_full_linear_; | ||
| } else { | ||
| charge = size_full_only_; | ||
| } |
There was a problem hiding this comment.
The per-key cache charge is decided by the loop index i (position within the current query batch) rather than by the key's position in the global block sequence. Because the same key can appear at different offsets across queries, it may be inserted with different charges, and the capacity simulation for full-only vs full-linear blocks becomes incorrect when a query does not start on a full-block boundary. Consider deriving the charge from the key's global sequence index or the instance's linear_step pattern instead of the local batch index.
🤖 Generated by Qoder
| data.instance_groups.push_back(group); | ||
| instance_groups_[name] = group; | ||
| } |
There was a problem hiding this comment.
LoadRecoveryData mutates instance_groups_ here (and instance_infos_ below) without holding mutex_, unlike every other mutating method in this class. It also appends to the output vectors and inserts into the member maps without clearing them first, so a second call can leave stale entries that no longer exist in storage. Please take std::unique_lock<std::shared_mutex>, clear data and the member maps at the start, and then load from storage.
🤖 Generated by Qoder
| double hit_rate = static_cast<double>(info.hit_count) / static_cast<double>(p->total_blocks()); | ||
| std::string cap_str = std::to_string(info.capacity_gb); | ||
|
|
||
| MetricsTags prom_tags = {{"instance_id", instance_id}, {"client_ip", client_ip}, {"capacity_gb", cap_str}}; |
There was a problem hiding this comment.
ReportPerQuery creates Prometheus gauges tagged with client_ip for every request. In production the client IP set is unbounded, so the label cardinality grows without limit and MetricsRegistry memory will leak. Consider dropping client_ip from per-query gauges, aggregating to a bounded set of buckets, or implementing an eviction policy for stale label combinations.
🤖 Generated by Qoder
| std::string bucket_label = bucket.threshold_seconds > 0 | ||
| ? std::to_string(bucket.threshold_seconds) + "s" | ||
| : "inf"; | ||
| MetricsTags bucket_tags = {{"instance_id", s.instance_id}, {"age_bucket", bucket_label}}; |
There was a problem hiding this comment.
ReportInterval registers Prometheus gauges keyed by instance_id, capacity_gb, and age_bucket, but there is no cleanup when an instance is removed or its configuration changes. Stale series remain in MetricsRegistry forever, leaking memory and continuing to expose obsolete values on /metrics. Consider tracking previously registered series and deleting gauges for removed instances or labels.
🤖 Generated by Qoder
[optimizer] 新增 online optimizer 模块,支持理论命中率在线统计分析
- 独立于 KVCM Manager 运行,不依赖 data_storage / RegistryManager 等重型组件
- 通过 TraceQuery 接口接收推理引擎的 block key 序列,模拟缓存行为,统计理论命中率
- FenwickCacheIndexer:基于 Fenwick Tree,O(log n) 查询/更新
- BSTCacheIndexer:基于 std::map 平衡树
- LRUCacheIndexer:基于 LRU 链表
- TTLCacheIndexerWrapper:为任意 indexer 叠加 TTL 过期驱逐
- 支持多容量档位(capacity_gb)前缀命中统计 + hit-age 分桶
- InstanceGroup CRUD + Instance 注册/注销/恢复
- TraceQuery:接收 block_keys,驱动 indexer 计算每容量档位命中率
- 持久化 InstanceGroup / InstanceInfo 到 RegistryBackend(Redis/Local)
- optimizer_service.proto:定义 11 个 RPC(InstanceGroup CRUD、Register/Remove/GetInstance、TraceQuery、ListInstances、ResetStats)
- OptimizerServiceImpl:请求路由与参数校验
- optimizer_service_grpc / optimizer_service_http:协议适配
- OptimizerMetricsCollector:采集 per-instance 命中率、unique keys、驱逐数、内存用量等
- OptimizerMetricsReporter:周期性聚合并通过 KMonitor/Prometheus 上报
- online_optimizer_server_main:独立可执行入口
- default_optimizer_config.json + start_optimizer_server.sh:默认配置与启动脚本
- Dockerfile.optimizer:独立 Docker 镜像
- 覆盖 indexer(Fenwick/BST/TTL)、manager、service_impl、metrics_reporter、server_config、optimizer_config
- 集成测试验证完整 RegisterInstance → TraceQuery → ListInstances 流程