Skip to content

Emit health metrics for ccp without requiring fluent-bit#1414

Open
davidkydd wants to merge 21 commits intoAzure:mainfrom
davidkydd:copilot/update-ccp-health-metrics
Open

Emit health metrics for ccp without requiring fluent-bit#1414
davidkydd wants to merge 21 commits intoAzure:mainfrom
davidkydd:copilot/update-ccp-health-metrics

Conversation

@davidkydd
Copy link
Contributor

@davidkydd davidkydd commented Feb 15, 2026

PR: CCP Health Metrics — Expose Prometheus Collector Health Metrics in CCP Mode

Summary

In CCP (Control Plane Components) mode, the ama-metrics-ccp pod runs without fluent-bit. Previously, the health metrics pipeline (timeseries received/sent/bytes per minute, settings validation, exporting failures) relied entirely on fluent-bit's Lua plugins to parse ME logs and feed Prometheus gauges on port :2234. This meant CCP pods had zero health observability — the metrics endpoint wasn't exposed at all.

This PR adds native Go implementations that replace the fluent-bit pipeline for CCP mode, exposing all 8 health metrics on :2234 without any fluent-bit dependency.

Problem

  • CCP pods don't run fluent-bit (it's gated by ccpMetricsEnabled != "true" in main.go)
  • Without fluent-bit, no health metrics were emitted — no way to detect if ME is ingesting timeseries, if the settings configmap is valid, or if otelcollector exports are failing
  • The configmap parser had early-return bugs that caused CrashLoopBackOff when settings parsing failed, instead of falling back to defaults

Changes

New Files (CCP-only, in otelcollector/shared/)

File Purpose
health_metrics.go Defines all 8 Prometheus gauge/counter metrics, registers them, and exposes on :2234. Reads ME volume globals + env vars each tick (60s)
me_log_tailer.go Tails /MetricsExtensionConsoleDebugLog.log line-by-line parsing ProcessedCount/EventsProcessedLastPeriod lines → feeds TimeseriesReceivedTotal, TimeseriesSentTotal, BytesSentTotal. Also tails otelcollector log for "Exporting failed" messages
otelcol_health_scraper.go Scrapes otelcollector's internal :8888/metrics every 15s for supplementary diagnostics (otelcol_receiver_accepted, otelcol_exporter_sent, otelcol_exporter_send_failed). Computes per-minute rates and accumulates send-failed totals
health_metrics_test.go Unit tests for metric registration, endpoint output, label handling, settings config validation, mutex safety, exporting failed counter
me_log_tailer_test.go Unit tests for ME log line parsing (ProcessedCount, EventsProcessedLastPeriod), multi-line accumulation, file tailing with live writes, partial match handling
otelcol_health_scraper_test.go Unit tests for Prometheus line parsing, HTTP scraping with test servers, error handling (500s, connection refused, empty response), delta accumulation and counter reset logic

Modified Files

File Change
otelcollector/main/main.go Added else branch for CCP mode: starts ExposePrometheusCollectorHealthMetrics, TailMELogFile, TailOtelCollectorLogFile, ScrapeOtelCollectorHealthMetrics as goroutines
otelcollector/shared/configmap/ccp/configmapparserforccp.go (1) Initialize AZMON_INVALID_METRICS_SETTINGS_CONFIG=false at startup. (2) Set it to true with descriptive error on any parse failure. (3) Fix early-return bug — replaced return statements with if/else blocks so collector continues with defaults on error. (4) Added cleanSettingsError() helper for concise, actionable error messages in the metric label. (5) Error messages now name the source configmap (ama-metrics-settings-configmap)
otelcollector/deploy/addon-chart/ccp-metrics-plugin/templates/ama-metrics-deployment.yaml Expose container port 2234 (health-metrics) on CCP deployment
Various go.mod/go.sum Added prometheus/client_golang and transitive dependencies (see Dependency Changes section)

Health Metrics Exposed (port 2234)

Metric Type Source Description
timeseries_received_per_minute Gauge ME log (EventsProcessedLastPeriod) Timeseries received by ME per minute
timeseries_sent_per_minute Gauge ME log (SentToPublicationCount) Timeseries sent to workspace per minute
bytes_sent_per_minute Gauge ME log (SentToPublicationBytes) Bytes sent to workspace per minute
invalid_metrics_settings_config Gauge Configmap parser env vars 0=valid, 1=invalid. error label contains reason
exporting_metrics_failed Counter Otelcollector log ("Exporting failed") Count of export failures from otelcol
otelcol_receiver_accepted_metric_points_per_minute Gauge Otelcol :8888/metrics Rate of metric points accepted by otelcol receiver
otelcol_exporter_sent_metric_points_per_minute Gauge Otelcol :8888/metrics Rate of metric points sent from otelcol to ME
otelcol_exporter_send_failed_metric_points_total Gauge Otelcol :8888/metrics Cumulative metric points that failed otelcol→ME export

Architecture

CCP Pod (ama-metrics-ccp)
├── prometheus-collector container
│   ├── main.go ─── CCP mode branch
│   │   ├── go ExposePrometheusCollectorHealthMetrics()  ← serves :2234/metrics
│   │   ├── go TailMELogFile()                           ← parses ME log → primary metrics
│   │   ├── go TailOtelCollectorLogFile()                ← watches for export failures
│   │   └── go ScrapeOtelCollectorHealthMetrics()        ← scrapes :8888 → diagnostic metrics
│   └── configmapparserforccp.go
│       └── Sets AZMON_INVALID_METRICS_SETTINGS_CONFIG env var → invalid_metrics_settings_config gauge

Bug Fix: Early Return on Settings Parse Error

Before this PR, if the CCP settings configmap had a parse error (e.g., schema-version: v2 but missing controlplane-metrics key), Configmapparserforccp() would return early. This skipped all downstream collector setup (otelcollector config merge, scrape config validation), causing the pod to enter CrashLoopBackOff.

Fixed by replacing all 4 return statements with if/else blocks. On parse error, the collector now:

  1. Sets AZMON_INVALID_METRICS_SETTINGS_CONFIG=true with a descriptive error
  2. Continues with default configuration
  3. The health metric reports invalid_metrics_settings_config{error="..."} 1

Error Message Improvements

Error messages in the invalid_metrics_settings_config metric error label are now actionable:

Error path Old message New message
Config version read error Error reading config version file: <err> Unable to read config version from ama-metrics-settings-configmap (<path>): <err>. Using default configuration
Schema version read error Error reading config schema version file: <err> Unable to read schema version from ama-metrics-settings-configmap (<path>): <err>. Using default configuration
v2 parse error Error parsing files: <err> Failed to parse v2 settings from ama-metrics-settings-configmap: settings file not found: <path> (expected from ama-metrics-settings-configmap). Falling back to default configuration
v1 parse error Error parsing config: <err> Failed to parse v1 settings from ama-metrics-settings-configmap: <err>. Falling back to default configuration

The cleanSettingsError() helper strips redundant OS-level error wrapping (e.g., duplicate paths from os.Open) for file-not-found errors. These improvements are scoped to configmapparserforccp.go only — helpers.go (shared by MP) is unchanged.

Dependency Changes

All dependency additions are required. The only new direct dependency is github.com/prometheus/client_golang v1.23.2 (imported by health_metrics.go), matching the version used by fluent-bit, otel-allocator, and prometheusreceiver in this repo. All other additions are transitive:

otelcollector/shared/go.mod (direct changes):

Dependency Type Reason
github.com/prometheus/client_golang v1.23.2 Direct Imported by health_metrics.go for Prometheus gauge/counter registration and HTTP handler. Matches fluent-bit/otel-allocator/prometheusreceiver version
golang.org/x/sys v0.31.0 Direct Pre-existing import in bootstrap_certificates_windows.go — promoted from missing by go mod tidy
golang.org/x/text v0.23.0 Direct Pre-existing import in bootstrap_certificates_windows.go — promoted from indirect by go mod tidy
gopkg.in/yaml.v2 v2.4.0 Direct Pre-existing import in collector_replicaset_config_helper.go — promoted from missing by go mod tidy
beorn7/perks, cespare/xxhash/v2, klauspost/compress, munnerz/goautoneg, prometheus/procfs Indirect Transitive dependencies of prometheus/client_golang

Other go.mod files (otelcollector/go.mod, configuration-reader-builder/go.mod, shared/configmap/ccp/go.mod, shared/configmap/mp/go.mod):

  • All additions are // indirect — these modules depend on shared, so they inherit the transitive dependency on prometheus/client_golang.

otelcollector/prometheusreceiver/go.mod:

  • Whitespace cleanup only (removed blank lines). No dependency changes.

Testing

  • Unit tests: 6 test files with comprehensive coverage of parsing, scraping, delta accumulation, counter resets, mutex safety, and metric registration
  • End-to-end validation (manual, on standalone cluster):
    • Deployed with valid config → verified all metrics emitting, invalid_metrics_settings_config = 0
    • Applied munged configmap (v2 schema, missing controlplane-metrics) → verified metric = 1 with descriptive error text
    • Deleted munged configmap → pod restarted, verified metric back to 0
    • Confirmed no CrashLoopBackOff with the early-return fix
    • See otelcollector/test/ccp/ccp-health-metrics-workflow.md for full test procedure

Non-CCP Impact

Zero impact on non-CCP builds. The CCP mode branch in main.go only activates when ccpMetricsEnabled == "true". All new Go files are in the shared package but only called from the CCP code path. The cleanSettingsError helper and improved error messages are scoped to configmapparserforccp.go only — helpers.go (shared by MP) is unchanged.

Copilot AI and others added 5 commits February 15, 2026 22:57
Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>
Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>
Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>
Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>
@davidkydd davidkydd requested a review from a team as a code owner February 15, 2026 23:16
Copilot AI and others added 13 commits February 15, 2026 23:23
Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>
Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>
Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>
Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>
In CCP mode, fluent-bit is not started so the ME log parsing path that
normally increments TimeseriesReceivedTotal/TimeseriesSentTotal/BytesSentTotal
is not available. This adds a goroutine that periodically scrapes the
otelcollector's own internal metrics endpoint (port 8888) and feeds
deltas from otelcol_receiver_accepted_metric_points and
otelcol_exporter_sent_metric_points into the shared health metric globals.

This makes the health metrics endpoint on port 2234 report non-zero
timeseries_received_per_minute and timeseries_sent_per_minute values
when the otelcollector pipeline is actively processing data.
…metric_points

Track otelcol_exporter_send_failed_metric_points counter deltas and feed them
into OtelCollectorExportingFailedCount so the exporting_metrics_failed health
metric on port 2234 reflects actual export failures in CCP mode.
Primary health metrics (timeseries_received/sent, bytes_sent) are now
derived from ME stdout log parsing, matching what fluent-bit did without
reintroducing it. The otelcol scraper is repurposed as a diagnostic
supplement exposing otelcol_receiver_accepted, otelcol_exporter_sent,
and otelcol_exporter_send_failed rates on port 2234.

This enables distinguishing otelcol->ME failures from ME->workspace
failures, closing the observability gap where otelcol metrics only
showed sent-to-ME not ME-published-to-workspace.

New files:
- me_log_tailer.go: TailMELogs parses ME stdout for ProcessedCount,
  SentToPublicationCount, EventsProcessedLastPeriod regexes
- me_log_tailer.go: TailOtelCollectorLogFile watches collector-log.txt

Modified:
- process_utilities_linux.go: Route ME stdout through TailMELogs in CCP
- otelcol_health_scraper.go: Feed diagnostic globals instead of primary
- health_metrics.go: 3 new diagnostic gauges + registration + ticker
- main.go: Start otelcol log tailer in CCP mode
ME is started with -Logger File, which writes ProcessedCount and
EventsProcessedLastPeriod lines to /MetricsExtensionConsoleDebugLog.log,
not stdout. The previous implementation reading stdout only got startup
messages.

Changes:
- Rename TailMELogs(io.Reader) to TailMELogFile(filePath string)
- Tail /MetricsExtensionConsoleDebugLog.log using file poll (like TailOtelCollectorLogFile)
- Revert process_utilities_linux.go: always copyOutputPipe for ME stdout
- Start TailMELogFile from main.go CCP section
- Update tests to use temp files
- Rename invalid_custom_prometheus_config -> invalid_metrics_settings_config for CCP mode
- Change env var from AZMON_INVALID_CUSTOM_PROMETHEUS_CONFIG -> AZMON_INVALID_METRICS_SETTINGS_CONFIG
- Add settings configmap validation: flag parsing errors (v1/v2 schema, file read)
- Set INVALID_SETTINGS_CONFIG_ERROR with error details on parse failure
- Initialize validation env vars at start of CCP parser (default: valid)
- Non-CCP (fluent-bit) code unchanged - keeps original metric name
When the settings configmap has parse errors, continue with default
configuration instead of returning early. The early return prevented
the downstream collector config setup (config merger, validator) from
running, which caused the otelcollector to fail to start and the
liveness probe to fail.

Now the error is recorded in AZMON_INVALID_METRICS_SETTINGS_CONFIG
and the collector starts with defaults, allowing the health metric
to properly report the invalid config state.
@bragi92
Copy link
Member

bragi92 commented Feb 27, 2026

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

bragi92
bragi92 previously approved these changes Feb 27, 2026
Implement canonical 3-tier metric naming for CCP pipeline health monitoring:
- overall_* metrics: golden in/out/drop counters (pipeline-wide)
- otelcol_* metrics: OtelCollector stage in/out/drop + export failures
- me_* metrics: Metrics Extension stage in/out/drop
bragi92
bragi92 previously approved these changes Mar 2, 2026
@bragi92
Copy link
Member

bragi92 commented Mar 2, 2026

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@davidkydd davidkydd enabled auto-merge (squash) March 3, 2026 03:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants