Emit health metrics for ccp without requiring fluent-bit#1414
Open
davidkydd wants to merge 21 commits intoAzure:mainfrom
Open
Emit health metrics for ccp without requiring fluent-bit#1414davidkydd wants to merge 21 commits intoAzure:mainfrom
davidkydd wants to merge 21 commits intoAzure:mainfrom
Conversation
Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>
Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>
Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>
Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>
Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>
Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>
Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>
Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>
In CCP mode, fluent-bit is not started so the ME log parsing path that normally increments TimeseriesReceivedTotal/TimeseriesSentTotal/BytesSentTotal is not available. This adds a goroutine that periodically scrapes the otelcollector's own internal metrics endpoint (port 8888) and feeds deltas from otelcol_receiver_accepted_metric_points and otelcol_exporter_sent_metric_points into the shared health metric globals. This makes the health metrics endpoint on port 2234 report non-zero timeseries_received_per_minute and timeseries_sent_per_minute values when the otelcollector pipeline is actively processing data.
…metric_points Track otelcol_exporter_send_failed_metric_points counter deltas and feed them into OtelCollectorExportingFailedCount so the exporting_metrics_failed health metric on port 2234 reflects actual export failures in CCP mode.
Primary health metrics (timeseries_received/sent, bytes_sent) are now derived from ME stdout log parsing, matching what fluent-bit did without reintroducing it. The otelcol scraper is repurposed as a diagnostic supplement exposing otelcol_receiver_accepted, otelcol_exporter_sent, and otelcol_exporter_send_failed rates on port 2234. This enables distinguishing otelcol->ME failures from ME->workspace failures, closing the observability gap where otelcol metrics only showed sent-to-ME not ME-published-to-workspace. New files: - me_log_tailer.go: TailMELogs parses ME stdout for ProcessedCount, SentToPublicationCount, EventsProcessedLastPeriod regexes - me_log_tailer.go: TailOtelCollectorLogFile watches collector-log.txt Modified: - process_utilities_linux.go: Route ME stdout through TailMELogs in CCP - otelcol_health_scraper.go: Feed diagnostic globals instead of primary - health_metrics.go: 3 new diagnostic gauges + registration + ticker - main.go: Start otelcol log tailer in CCP mode
ME is started with -Logger File, which writes ProcessedCount and EventsProcessedLastPeriod lines to /MetricsExtensionConsoleDebugLog.log, not stdout. The previous implementation reading stdout only got startup messages. Changes: - Rename TailMELogs(io.Reader) to TailMELogFile(filePath string) - Tail /MetricsExtensionConsoleDebugLog.log using file poll (like TailOtelCollectorLogFile) - Revert process_utilities_linux.go: always copyOutputPipe for ME stdout - Start TailMELogFile from main.go CCP section - Update tests to use temp files
- Rename invalid_custom_prometheus_config -> invalid_metrics_settings_config for CCP mode - Change env var from AZMON_INVALID_CUSTOM_PROMETHEUS_CONFIG -> AZMON_INVALID_METRICS_SETTINGS_CONFIG - Add settings configmap validation: flag parsing errors (v1/v2 schema, file read) - Set INVALID_SETTINGS_CONFIG_ERROR with error details on parse failure - Initialize validation env vars at start of CCP parser (default: valid) - Non-CCP (fluent-bit) code unchanged - keeps original metric name
When the settings configmap has parse errors, continue with default configuration instead of returning early. The early return prevented the downstream collector config setup (config merger, validator) from running, which caused the otelcollector to fail to start and the liveness probe to fail. Now the error is recorded in AZMON_INVALID_METRICS_SETTINGS_CONFIG and the collector starts with defaults, allowing the health metric to properly report the invalid config state.
Member
|
/azp run |
|
No pipelines are associated with this pull request. |
bragi92
previously approved these changes
Feb 27, 2026
…nts overall component exporting failures
Implement canonical 3-tier metric naming for CCP pipeline health monitoring: - overall_* metrics: golden in/out/drop counters (pipeline-wide) - otelcol_* metrics: OtelCollector stage in/out/drop + export failures - me_* metrics: Metrics Extension stage in/out/drop
bragi92
previously approved these changes
Mar 2, 2026
Member
|
/azp run |
|
No pipelines are associated with this pull request. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR: CCP Health Metrics — Expose Prometheus Collector Health Metrics in CCP Mode
Summary
In CCP (Control Plane Components) mode, the
ama-metrics-ccppod runs without fluent-bit. Previously, the health metrics pipeline (timeseries received/sent/bytes per minute, settings validation, exporting failures) relied entirely on fluent-bit's Lua plugins to parse ME logs and feed Prometheus gauges on port:2234. This meant CCP pods had zero health observability — the metrics endpoint wasn't exposed at all.This PR adds native Go implementations that replace the fluent-bit pipeline for CCP mode, exposing all 8 health metrics on
:2234without any fluent-bit dependency.Problem
ccpMetricsEnabled != "true"inmain.go)Changes
New Files (CCP-only, in
otelcollector/shared/)health_metrics.go:2234. Reads ME volume globals + env vars each tick (60s)me_log_tailer.go/MetricsExtensionConsoleDebugLog.logline-by-line parsingProcessedCount/EventsProcessedLastPeriodlines → feedsTimeseriesReceivedTotal,TimeseriesSentTotal,BytesSentTotal. Also tails otelcollector log for "Exporting failed" messagesotelcol_health_scraper.go:8888/metricsevery 15s for supplementary diagnostics (otelcol_receiver_accepted,otelcol_exporter_sent,otelcol_exporter_send_failed). Computes per-minute rates and accumulates send-failed totalshealth_metrics_test.gome_log_tailer_test.gootelcol_health_scraper_test.goModified Files
otelcollector/main/main.goelsebranch for CCP mode: startsExposePrometheusCollectorHealthMetrics,TailMELogFile,TailOtelCollectorLogFile,ScrapeOtelCollectorHealthMetricsas goroutinesotelcollector/shared/configmap/ccp/configmapparserforccp.goAZMON_INVALID_METRICS_SETTINGS_CONFIG=falseat startup. (2) Set it totruewith descriptive error on any parse failure. (3) Fix early-return bug — replacedreturnstatements withif/elseblocks so collector continues with defaults on error. (4) AddedcleanSettingsError()helper for concise, actionable error messages in the metric label. (5) Error messages now name the source configmap (ama-metrics-settings-configmap)otelcollector/deploy/addon-chart/ccp-metrics-plugin/templates/ama-metrics-deployment.yamlhealth-metrics) on CCP deploymentgo.mod/go.sumprometheus/client_golangand transitive dependencies (see Dependency Changes section)Health Metrics Exposed (port 2234)
timeseries_received_per_minuteEventsProcessedLastPeriod)timeseries_sent_per_minuteSentToPublicationCount)bytes_sent_per_minuteSentToPublicationBytes)invalid_metrics_settings_configerrorlabel contains reasonexporting_metrics_failedotelcol_receiver_accepted_metric_points_per_minute:8888/metricsotelcol_exporter_sent_metric_points_per_minute:8888/metricsotelcol_exporter_send_failed_metric_points_total:8888/metricsArchitecture
Bug Fix: Early Return on Settings Parse Error
Before this PR, if the CCP settings configmap had a parse error (e.g.,
schema-version: v2but missingcontrolplane-metricskey),Configmapparserforccp()wouldreturnearly. This skipped all downstream collector setup (otelcollector config merge, scrape config validation), causing the pod to enter CrashLoopBackOff.Fixed by replacing all 4
returnstatements withif/elseblocks. On parse error, the collector now:AZMON_INVALID_METRICS_SETTINGS_CONFIG=truewith a descriptive errorinvalid_metrics_settings_config{error="..."} 1Error Message Improvements
Error messages in the
invalid_metrics_settings_configmetricerrorlabel are now actionable:Error reading config version file: <err>Unable to read config version from ama-metrics-settings-configmap (<path>): <err>. Using default configurationError reading config schema version file: <err>Unable to read schema version from ama-metrics-settings-configmap (<path>): <err>. Using default configurationError parsing files: <err>Failed to parse v2 settings from ama-metrics-settings-configmap: settings file not found: <path> (expected from ama-metrics-settings-configmap). Falling back to default configurationError parsing config: <err>Failed to parse v1 settings from ama-metrics-settings-configmap: <err>. Falling back to default configurationThe
cleanSettingsError()helper strips redundant OS-level error wrapping (e.g., duplicate paths fromos.Open) for file-not-found errors. These improvements are scoped toconfigmapparserforccp.goonly —helpers.go(shared by MP) is unchanged.Dependency Changes
All dependency additions are required. The only new direct dependency is
github.com/prometheus/client_golang v1.23.2(imported byhealth_metrics.go), matching the version used by fluent-bit, otel-allocator, and prometheusreceiver in this repo. All other additions are transitive:otelcollector/shared/go.mod(direct changes):github.com/prometheus/client_golang v1.23.2health_metrics.gofor Prometheus gauge/counter registration and HTTP handler. Matches fluent-bit/otel-allocator/prometheusreceiver versiongolang.org/x/sys v0.31.0bootstrap_certificates_windows.go— promoted from missing bygo mod tidygolang.org/x/text v0.23.0bootstrap_certificates_windows.go— promoted from indirect bygo mod tidygopkg.in/yaml.v2 v2.4.0collector_replicaset_config_helper.go— promoted from missing bygo mod tidybeorn7/perks,cespare/xxhash/v2,klauspost/compress,munnerz/goautoneg,prometheus/procfsprometheus/client_golangOther
go.modfiles (otelcollector/go.mod,configuration-reader-builder/go.mod,shared/configmap/ccp/go.mod,shared/configmap/mp/go.mod):// indirect— these modules depend onshared, so they inherit the transitive dependency onprometheus/client_golang.otelcollector/prometheusreceiver/go.mod:Testing
invalid_metrics_settings_config= 0controlplane-metrics) → verified metric = 1 with descriptive error textotelcollector/test/ccp/ccp-health-metrics-workflow.mdfor full test procedureNon-CCP Impact
Zero impact on non-CCP builds. The CCP mode branch in
main.goonly activates whenccpMetricsEnabled == "true". All new Go files are in the shared package but only called from the CCP code path. ThecleanSettingsErrorhelper and improved error messages are scoped toconfigmapparserforccp.goonly —helpers.go(shared by MP) is unchanged.