[observer] High-frequency system check runner (1s interval)#48707
[observer] High-frequency system check runner (1s interval)#48707scottopell merged 1 commit intoq-branch-observerfrom
Conversation
Go Package Import DifferencesBaseline: e5b320d
|
Files inventory check summaryFile checks results against ancestor e5b320de: Results for datadog-agent_7.79.0~devel.git.408.b4bf1a9.pipeline.105946945-1_amd64.deb:Detected file changes:
|
Static quality checks❌ Please find below the results from static quality gates Error
Gate failure full details
Static quality gates prevent the PR to merge! Successful checksInfo
2 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
3d095cf to
1f186c2
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1f186c2638
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
CelianR
left a comment
There was a problem hiding this comment.
LGTM, thanks for adding telemetry!
1f186c2 to
b4bf1a9
Compare
|
@codex re-review please |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b4bf1a92d9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if h.dropCounter != nil { | ||
| h.dropCounter.Add(1, h.source) | ||
| } |
There was a problem hiding this comment.
Count dropped observations for all signal types
The new observer.channel.dropped telemetry is only incremented in ObserveMetric; dropped logs, traces, and profiles still update only dropCount. If the observer channel is saturated by non-metric traffic (for example log bursts), this counter will significantly under-report back-pressure by source, which defeats the purpose of this new metric for operational debugging. Please increment dropCounter in the drop paths of ObserveLog, ObserveTrace, and ObserveProfile too.
Useful? React with 👍 / 👎.
| // When true, the observer runs system checks (cpu, memory, disk, io, load, network, etc.) | ||
| // at 1-second intervals and feeds them directly into the anomaly detection pipeline. | ||
| // These high-frequency samples are never forwarded to Datadog intake. | ||
| // The normal 15-second system check pipeline is unaffected. | ||
| config.BindEnvAndSetDefault("observer.high_frequency_system_checks.enabled", false) |
There was a problem hiding this comment.
Document new HF system-check flag in the config template
This change adds observer.high_frequency_system_checks.enabled to runtime defaults, but the high-frequency observer section was removed from config_template.yaml without a replacement entry for the new key. That leaves users without generated config/env documentation for enabling this feature, which is a stale-doc regression for a behavior/config change and makes rollout/support harder.
Useful? React with 👍 / 👎.
Introduces a prototype to test whether 1-second system check metrics improve anomaly detection scores in comp/observer compared to the default 15-second cadence. - Adds `comp/observer/impl/hfrunner`: an observer-owned check runner that instantiates system checks (cpu, memory, disk, io, load, network, uptime, filehandles) using the existing factory catalog, runs them on a 1s tick, and routes output directly into the observer pipeline via a purpose-built `observerSender`. These samples never touch the aggregator or forwarder — intake isolation is guaranteed by design. - When `observer.high_frequency_system_checks.enabled: true`, a `systemFilteredHandle` is placed on the `"all-metrics"` pipeline handle to suppress 15s system.* samples from the scorer, so only the higher-resolution 1s stream influences detection. Filtering uses `MetricSource` enum comparison via type assertion rather than string prefix matching. - Adds `observer.channel.dropped` telemetry counter (tagged by source) to surface back-pressure from the observer's internal channel. - Wires runner shutdown via `fx.Lifecycle.OnStop`; `Runner.Stop()` is guarded with `sync.Once` against double-close. Removes the `observer.metrics.high_frequency_interval` mechanism that previously hacked all check intervals globally via `CheckBase.Interval()`. That approach is replaced entirely by the runner above. - `checkbase.go`: `Interval()` simplified to `return c.checkInterval` - `config.go`: `observer.metrics.high_frequency_interval` and `observer.metrics.enabled` config keys removed - `sender.go`: `AggregatingSender` instantiation path removed - `config_template.yaml`: stale `@param` doc block removed - `TestSystemFilteredHandle`: table-driven unit test covering all 8 system check sources (dropped), non-system sources (pass through), `MetricSourceUnknown` (pass through), and no-sourceProvider samples (pass through). - `TestAggregatingSender_SourcePopulated`: regression test verifying all 12 metric methods on `AggregatingSender` stamp the correct `MetricSource` on observer-bound samples. - QA confirmed via proxy-dumper MiTM: `system.cpu.user` reached intake 4 times in 60s (15s cadence), not 60 times. - Observer debug dump confirmed 475 series under `"system-checks-hf"` with 28-29 points each at strict 1s intervals after 30s runtime. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
b4bf1a9 to
b6fefd3
Compare
Summary
comp/observer/impl/hfrunner: an observer-owned runner that instantiates system checks (cpu, memory, disk, io, load, network, uptime, filehandles) at 1-second intervals and routes their output directly into the observer anomaly detection pipeline. These samples never touch the aggregator or forwarder — intake isolation is guaranteed by design.observer.high_frequency_system_checks.enabled: true, asystemFilteredHandlesuppresses the normal 15s system.* samples from the scorer so only the higher-resolution stream influences detection. Filtering usesMetricSourceenum comparison (type assertion tosourceProvider) rather than string prefix matching.observer.channel.droppedtelemetry counter (tagged bysource) to surface back-pressure from the observer's internal channel.observer.metrics.high_frequency_intervalmechanism that previously hacked all check intervals globally viaCheckBase.Interval().Hypothesis
Does 1s system metric data improve anomaly detection scores vs the default 15s? This PR provides the infrastructure to answer that question — enable the flag, run evals, compare scores.
To enable
Test plan
TestSystemFilteredHandle: table-driven test — all 8 system sources dropped, non-system sources pass through,MetricSourceUnknownpasses through, no-sourceProvidersamples pass throughTestAggregatingSender_SourcePopulated: regression test — all 12 metric methods stamp correctMetricSourceon observer-bound samplessystem.cpu.userhit intake 4×/60s (15s cadence), not 60×"system-checks-hf"with 28–29 points at strict 1s intervals after 30s runtime🤖 Generated with Claude Code