Skip to content

[optimizer] support theoretical hit rate statistical analysis in KVCM…#210

Open
YoungRX wants to merge 1 commit into
mainfrom
feature/online-optimizer
Open

[optimizer] support theoretical hit rate statistical analysis in KVCM…#210
YoungRX wants to merge 1 commit into
mainfrom
feature/online-optimizer

Conversation

@YoungRX

@YoungRX YoungRX commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

[optimizer] 新增 online optimizer 模块,支持理论命中率在线统计分析

  1. 新增 online_optimizer/ 模块(59 文件,+7842 行)
    - 独立于 KVCM Manager 运行,不依赖 data_storage / RegistryManager 等重型组件
    - 通过 TraceQuery 接口接收推理引擎的 block key 序列,模拟缓存行为,统计理论命中率
  2. CacheIndexer 索引层:3 种索引实现 + TTL 包装器
    - FenwickCacheIndexer:基于 Fenwick Tree,O(log n) 查询/更新
    - BSTCacheIndexer:基于 std::map 平衡树
    - LRUCacheIndexer:基于 LRU 链表
    - TTLCacheIndexerWrapper:为任意 indexer 叠加 TTL 过期驱逐
    - 支持多容量档位(capacity_gb)前缀命中统计 + hit-age 分桶
  3. OnlineOptimizerManager:核心编排层
    - InstanceGroup CRUD + Instance 注册/注销/恢复
    - TraceQuery:接收 block_keys,驱动 indexer 计算每容量档位命中率
    - 持久化 InstanceGroup / InstanceInfo 到 RegistryBackend(Redis/Local)
  4. 服务层:gRPC + HTTP 双协议
    - optimizer_service.proto:定义 11 个 RPC(InstanceGroup CRUD、Register/Remove/GetInstance、TraceQuery、ListInstances、ResetStats)
    - OptimizerServiceImpl:请求路由与参数校验
    - optimizer_service_grpc / optimizer_service_http:协议适配
  5. Metrics:Prometheus 指标上报
    - OptimizerMetricsCollector:采集 per-instance 命中率、unique keys、驱逐数、内存用量等
    - OptimizerMetricsReporter:周期性聚合并通过 KMonitor/Prometheus 上报
  6. 部署与打包
    - online_optimizer_server_main:独立可执行入口
    - default_optimizer_config.json + start_optimizer_server.sh:默认配置与启动脚本
    - Dockerfile.optimizer:独立 Docker 镜像
  7. 测试:8 个 UT + 1 个集成测试
    - 覆盖 indexer(Fenwick/BST/TTL)、manager、service_impl、metrics_reporter、server_config、optimizer_config
    - 集成测试验证完整 RegisterInstance → TraceQuery → ListInstances 流程

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b752876e5d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +138 to +140
state->indexer = CacheIndexerFactory::CreateCacheIndexer(instance_group.indexer_type(), instance_group.max_key_count(),
capacity_gb, size_full_only, size_full_linear, linear_step,
instance_group.ttl_seconds());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject invalid indexers before accepting registrations

When an instance group is created or updated with an unsupported indexer_type (the service path does not call ValidateRequiredFields), CreateCacheIndexer returns nullptr. This assignment stores that null indexer and RegisterInstance still returns OK, so the next TraceQuery, ListInstances, or ResetStats dereferences state->indexer and can crash the optimizer server. Please either validate groups before saving them or fail registration when the factory returns null.

Useful? React with 👍 / 👎.

Comment on lines +65 to +67
// Rollback persistence on failure
if (registry_manager_) {
registry_manager_->DeleteInstanceInfo(instance_id);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve existing registrations on failed re-register

If an existing instance is re-registered and RegisterInstanceInternal then fails (for example because the new request has empty specs or the group is disabled), this rollback deletes instance_id from the persistent registry after SaveInstanceInfo has already overwritten the previous good record. The old in-memory state continues running until restart, but recovery will lose the instance, so a failed admin call can silently remove an existing registration from persistence; validate/build before overwriting or restore the previous value on rollback.

Useful? React with 👍 / 👎.

Comment on lines +104 to +105
if (logical_time_ >= total_slots_) {
DoCompact();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid compacting fenwick state with stale access entries

When an existing key triggers compaction after the logical clock reaches total_slots_, its old timestamp has already been removed from fenwick_/reverse_map_ but last_access_ still contains that old timestamp. DoCompact() therefore re-adds this same key at the stale timestamp and the code then inserts it again at the new timestamp, leaving duplicate active entries and corrupting stack distances/evictions for long-running fenwick_lru instances; remove or update the last_access_ entry before compacting, or compact after installing the new timestamp.

Useful? React with 👍 / 👎.

Comment on lines +171 to +172
if (registry_manager_) {
registry_manager_->DeleteInstanceInfo(instance_id);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate failed persistent instance deletes

When a registry backend is configured and DeleteInstanceInfo fails (for example a Redis/local storage error), this path has already erased the in-memory instance but still returns OK and leaves the old instance record in persistent storage. The instance disappears immediately but is recovered after the next restart, so removals are not durable under storage errors; handle the delete error or roll back the in-memory erase.

Useful? React with 👍 / 👎.

}
}

instance_groups_.erase(group_name);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject removing groups that still own instances

If an admin removes a group while instances in that group are still registered, this erases only the group record; the instance records and in-memory instances remain. On restart recovery then skips those persisted instances because their group no longer exists, so this call silently drops active registrations after a restart; reject the removal when ListInstanceInfos(group_name) is non-empty or remove the dependent instances consistently.

Useful? React with 👍 / 👎.


auto registry = manager_->registry_manager();
auto group = ConvertProtoToInstanceGroup(request->instance_group());
ErrorCode ec = registry ? registry->UpdateInstanceGroup(group) : EC_ERROR;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply group updates to active instances

Updating a group only changes the registry entry, while each registered InstanceState keeps its own copied instance_group from registration. For active instances, changes to capacity tiers, TTL, or indexer type therefore do not affect TraceQuery/metrics until the instance is re-registered or the server restarts, and recovery will then rebuild with different behavior; either rebuild affected instances on update or reject updates while instances are registered.

Useful? React with 👍 / 👎.

}
});

return true;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Surface HTTP startup failures from Start

If the HTTP port is already in use or async_start() fails, CoroHttpService::Start returns false inside this background thread, but InitHttpServer still returns true and OnlineOptimizerServer::Start reports success with only the gRPC side running and no HTTP/metrics endpoint. Propagate the thread's startup result before returning from Start, or start the HTTP server synchronously through the bind phase.

Useful? React with 👍 / 👎.

@qoderai qoderai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Thanks for this large and well-structured contribution — the online optimizer module is clearly separated from the rest of KVCM and the indexer/manager/service layering is easy to follow.

Key Risks & Issues

  • Group validation is missing at the service layer. ConvertProtoToInstanceGroup/CreateInstanceGroup/UpdateInstanceGroup persist an OptimizerInstanceGroup without calling ValidateRequiredFields, so empty names, unsupported indexer_type, or enabled=true with no capacities can be saved. This feeds the null-indexer crash already noted for RegisterInstance.
  • ResetStats can also crash on a bad indexer type. It rebuilds the indexer with CreateCacheIndexer and stores the result unchecked, so the same null-dereference path exists after a stats reset.
  • LRU indexer charge logic looks incorrect. LruCacheIndexer::ProcessKeys derives charge from the loop index i (position inside the current query batch) rather than from the key's position in the global block sequence. The same key can therefore be charged differently across queries, and a query that does not start on a full-block boundary will simulate the wrong capacity.
  • Registry recovery has synchronization and staleness bugs. LoadRecoveryData writes instance_groups_/instance_infos_ without taking mutex_, and it never clears the output vectors or the in-memory maps, so stale entries survive a second recovery call.
  • Metrics cardinality and memory leaks. Per-query Prometheus gauges are tagged with client_ip, which is unbounded in production. Interval gauges keyed by instance_id/capacity_gb/age_bucket are created but never removed when instances go away, so MetricsRegistry grows monotonically.
  • KMonitor lifecycle is incomplete. InitKmonitor() starts the global KMonitor factory but neither the reporter destructor nor OnlineOptimizerServer::Stop() stops it, risking background-thread leaks. The return value of InitKmonitor() is also ignored.
  • Admin interfaces are plaintext and unauthenticated. gRPC uses InsecureServerCredentials() on 0.0.0.0, and the HTTP endpoints perform no auth/ACL checks before mutating groups and instances.
  • Recovery failures are swallowed. OnlineOptimizerServer::Init logs a warning when manager_->Recover() fails and continues; combined with the non-atomic read-modify-write pattern in LoadAndSave/LoadAndDelete, this makes multi-process or storage-error scenarios easy to corrupt.

Verification Advice

  • Add unit tests that create groups with empty names / invalid indexer_type and assert INVALID_ARGUMENT.
  • Add a test for ResetStats after changing a group's indexer_type to an invalid value.
  • Run ReportPerQuery with a large, varying set of client_ips and measure MetricsRegistry memory.
  • Register then remove an instance and verify that /metrics no longer exposes its series.
  • Test recovery after manually creating a group and after a storage error.

Thoughts & Suggestions

The module's shape is solid. The highest-value follow-ups would be (1) validating all configuration at the service boundary, (2) fixing the LRU charge model, and (3) bounding or cleaning up metric labels before running this in production.


🤖 Generated by Qoder

Comment on lines +85 to +93
void OptimizerServiceImpl::CreateInstanceGroup(RequestContext *request_context,
const proto::optimizer::CreateInstanceGroupRequest *request,
proto::optimizer::CommonResponse *response) {
request_context->set_api_name("CreateInstanceGroup");
OptimizerCallGuard guard(request_context, metrics_reporter_.get());

auto registry = manager_->registry_manager();
auto group = ConvertProtoToInstanceGroup(request->instance_group());
ErrorCode ec = registry ? registry->CreateInstanceGroup(group) : EC_ERROR;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The service layer converts the protobuf group directly into an OptimizerInstanceGroup and persists it without calling ValidateRequiredFields. This allows groups with empty names, unsupported indexer_type, negative max_key_count, or enabled=true with no capacity_gb to be saved. Those invalid groups later cause registration or TraceQuery failures (or, for invalid indexer_type, a null indexer crash). Please validate required fields in CreateInstanceGroup and UpdateInstanceGroup before calling the registry, and surface an invalid-argument error to the caller.


🤖 Generated by Qoder

Comment on lines +297 to +303
state->indexer = CacheIndexerFactory::CreateCacheIndexer(state->instance_group->indexer_type(),
state->instance_group->max_key_count(),
state->instance_group->capacity_gb(),
state->size_full_only,
state->size_full_linear,
state->linear_step,
state->instance_group->ttl_seconds());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResetStats rebuilds the indexer with CacheIndexerFactory::CreateCacheIndexer but never checks the result. If the group's indexer_type is invalid or the factory otherwise returns nullptr, the next TraceQuery, ListInstances, or ResetStats dereferences state->indexer and crashes. Please guard the factory result and return an error if the indexer cannot be recreated.


🤖 Generated by Qoder

Comment on lines +81 to +90
int64_t charge;
if (linear_step_ <= 1) {
charge = size_full_linear_;
} else if (i == total_keys - 1) {
charge = size_full_linear_;
} else if (i % linear_step_ == 0) {
charge = size_full_linear_;
} else {
charge = size_full_only_;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-key cache charge is decided by the loop index i (position within the current query batch) rather than by the key's position in the global block sequence. Because the same key can appear at different offsets across queries, it may be inserted with different charges, and the capacity simulation for full-only vs full-linear blocks becomes incorrect when a query does not start on a full-block boundary. Consider deriving the charge from the key's global sequence index or the instance's linear_step pattern instead of the local batch index.


🤖 Generated by Qoder

Comment on lines +244 to +246
data.instance_groups.push_back(group);
instance_groups_[name] = group;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoadRecoveryData mutates instance_groups_ here (and instance_infos_ below) without holding mutex_, unlike every other mutating method in this class. It also appends to the output vectors and inserts into the member maps without clearing them first, so a second call can leave stale entries that no longer exist in storage. Please take std::unique_lock<std::shared_mutex>, clear data and the member maps at the start, and then load from storage.


🤖 Generated by Qoder

double hit_rate = static_cast<double>(info.hit_count) / static_cast<double>(p->total_blocks());
std::string cap_str = std::to_string(info.capacity_gb);

MetricsTags prom_tags = {{"instance_id", instance_id}, {"client_ip", client_ip}, {"capacity_gb", cap_str}};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReportPerQuery creates Prometheus gauges tagged with client_ip for every request. In production the client IP set is unbounded, so the label cardinality grows without limit and MetricsRegistry memory will leak. Consider dropping client_ip from per-query gauges, aggregating to a bounded set of buckets, or implementing an eviction policy for stale label combinations.


🤖 Generated by Qoder

std::string bucket_label = bucket.threshold_seconds > 0
? std::to_string(bucket.threshold_seconds) + "s"
: "inf";
MetricsTags bucket_tags = {{"instance_id", s.instance_id}, {"age_bucket", bucket_label}};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReportInterval registers Prometheus gauges keyed by instance_id, capacity_gb, and age_bucket, but there is no cleanup when an instance is removed or its configuration changes. Stale series remain in MetricsRegistry forever, leaking memory and continuing to expose obsolete values on /metrics. Consider tracking previously registered series and deleting gauges for removed instances or labels.


🤖 Generated by Qoder

@github-actions github-actions Bot added the ai reviewed AI has reviewed this PR label Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai reviewed AI has reviewed this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant