Emit health metrics for ccp without requiring fluent-bit by davidkydd · Pull Request #1414 · Azure/prometheus-collector

davidkydd · 2026-02-15T23:16:09Z

PR: CCP Health Metrics — Expose Prometheus Collector Health Metrics in CCP Mode

Summary

In CCP (Control Plane Components) mode, the ama-metrics-ccp pod runs without fluent-bit. Previously, the health metrics pipeline (timeseries received/sent/bytes per minute, settings validation, exporting failures) relied entirely on fluent-bit's Lua plugins to parse ME logs and feed Prometheus gauges on port :2234. This meant CCP pods had zero health observability — the metrics endpoint wasn't exposed at all.

This PR adds native Go implementations that replace the fluent-bit pipeline for CCP mode, exposing all 8 health metrics on :2234 without any fluent-bit dependency.

Problem

CCP pods don't run fluent-bit (it's gated by ccpMetricsEnabled != "true" in main.go)
Without fluent-bit, no health metrics were emitted — no way to detect if ME is ingesting timeseries, if the settings configmap is valid, or if otelcollector exports are failing
The configmap parser had early-return bugs that caused CrashLoopBackOff when settings parsing failed, instead of falling back to defaults

Changes

New Files (CCP-only, in `otelcollector/shared/`)

File	Purpose
`health_metrics.go`	Defines all 8 Prometheus gauge/counter metrics, registers them, and exposes on `:2234`. Reads ME volume globals + env vars each tick (60s)
`me_log_tailer.go`	Tails `/MetricsExtensionConsoleDebugLog.log` line-by-line parsing `ProcessedCount`/`EventsProcessedLastPeriod` lines → feeds `TimeseriesReceivedTotal`, `TimeseriesSentTotal`, `BytesSentTotal`. Also tails otelcollector log for "Exporting failed" messages
`otelcol_health_scraper.go`	Scrapes otelcollector's internal `:8888/metrics` every 15s for supplementary diagnostics (`otelcol_receiver_accepted`, `otelcol_exporter_sent`, `otelcol_exporter_send_failed`). Computes per-minute rates and accumulates send-failed totals
`health_metrics_test.go`	Unit tests for metric registration, endpoint output, label handling, settings config validation, mutex safety, exporting failed counter
`me_log_tailer_test.go`	Unit tests for ME log line parsing (ProcessedCount, EventsProcessedLastPeriod), multi-line accumulation, file tailing with live writes, partial match handling
`otelcol_health_scraper_test.go`	Unit tests for Prometheus line parsing, HTTP scraping with test servers, error handling (500s, connection refused, empty response), delta accumulation and counter reset logic

Modified Files

File	Change
`otelcollector/main/main.go`	Added `else` branch for CCP mode: starts `ExposePrometheusCollectorHealthMetrics`, `TailMELogFile`, `TailOtelCollectorLogFile`, `ScrapeOtelCollectorHealthMetrics` as goroutines
`otelcollector/shared/configmap/ccp/configmapparserforccp.go`	(1) Initialize `AZMON_INVALID_METRICS_SETTINGS_CONFIG=false` at startup. (2) Set it to `true` with descriptive error on any parse failure. (3) Fix early-return bug — replaced `return` statements with `if/else` blocks so collector continues with defaults on error. (4) Added `cleanSettingsError()` helper for concise, actionable error messages in the metric label. (5) Error messages now name the source configmap (`ama-metrics-settings-configmap`)
`otelcollector/deploy/addon-chart/ccp-metrics-plugin/templates/ama-metrics-deployment.yaml`	Expose container port 2234 (`health-metrics`) on CCP deployment
Various `go.mod`/`go.sum`	Added `prometheus/client_golang` and transitive dependencies (see Dependency Changes section)

Health Metrics Exposed (port 2234)

Metric	Type	Source	Description
`timeseries_received_per_minute`	Gauge	ME log (`EventsProcessedLastPeriod`)	Timeseries received by ME per minute
`timeseries_sent_per_minute`	Gauge	ME log (`SentToPublicationCount`)	Timeseries sent to workspace per minute
`bytes_sent_per_minute`	Gauge	ME log (`SentToPublicationBytes`)	Bytes sent to workspace per minute
`invalid_metrics_settings_config`	Gauge	Configmap parser env vars	0=valid, 1=invalid. `error` label contains reason
`exporting_metrics_failed`	Counter	Otelcollector log ("Exporting failed")	Count of export failures from otelcol
`otelcol_receiver_accepted_metric_points_per_minute`	Gauge	Otelcol `:8888/metrics`	Rate of metric points accepted by otelcol receiver
`otelcol_exporter_sent_metric_points_per_minute`	Gauge	Otelcol `:8888/metrics`	Rate of metric points sent from otelcol to ME
`otelcol_exporter_send_failed_metric_points_total`	Gauge	Otelcol `:8888/metrics`	Cumulative metric points that failed otelcol→ME export

Architecture

CCP Pod (ama-metrics-ccp)
├── prometheus-collector container
│   ├── main.go ─── CCP mode branch
│   │   ├── go ExposePrometheusCollectorHealthMetrics()  ← serves :2234/metrics
│   │   ├── go TailMELogFile()                           ← parses ME log → primary metrics
│   │   ├── go TailOtelCollectorLogFile()                ← watches for export failures
│   │   └── go ScrapeOtelCollectorHealthMetrics()        ← scrapes :8888 → diagnostic metrics
│   └── configmapparserforccp.go
│       └── Sets AZMON_INVALID_METRICS_SETTINGS_CONFIG env var → invalid_metrics_settings_config gauge

Bug Fix: Early Return on Settings Parse Error

Before this PR, if the CCP settings configmap had a parse error (e.g., schema-version: v2 but missing controlplane-metrics key), Configmapparserforccp() would return early. This skipped all downstream collector setup (otelcollector config merge, scrape config validation), causing the pod to enter CrashLoopBackOff.

Fixed by replacing all 4 return statements with if/else blocks. On parse error, the collector now:

Sets AZMON_INVALID_METRICS_SETTINGS_CONFIG=true with a descriptive error
Continues with default configuration
The health metric reports invalid_metrics_settings_config{error="..."} 1

Error Message Improvements

Error messages in the invalid_metrics_settings_config metric error label are now actionable:

Error path	Old message	New message
Config version read error	`Error reading config version file: <err>`	`Unable to read config version from ama-metrics-settings-configmap (<path>): <err>. Using default configuration`
Schema version read error	`Error reading config schema version file: <err>`	`Unable to read schema version from ama-metrics-settings-configmap (<path>): <err>. Using default configuration`
v2 parse error	`Error parsing files: <err>`	`Failed to parse v2 settings from ama-metrics-settings-configmap: settings file not found: <path> (expected from ama-metrics-settings-configmap). Falling back to default configuration`
v1 parse error	`Error parsing config: <err>`	`Failed to parse v1 settings from ama-metrics-settings-configmap: <err>. Falling back to default configuration`

The cleanSettingsError() helper strips redundant OS-level error wrapping (e.g., duplicate paths from os.Open) for file-not-found errors. These improvements are scoped to configmapparserforccp.go only — helpers.go (shared by MP) is unchanged.

Dependency Changes

All dependency additions are required. The only new direct dependency is github.com/prometheus/client_golang v1.23.2 (imported by health_metrics.go), matching the version used by fluent-bit, otel-allocator, and prometheusreceiver in this repo. All other additions are transitive:

otelcollector/shared/go.mod (direct changes):

Dependency	Type	Reason
`github.com/prometheus/client_golang v1.23.2`	Direct	Imported by `health_metrics.go` for Prometheus gauge/counter registration and HTTP handler. Matches fluent-bit/otel-allocator/prometheusreceiver version
`golang.org/x/sys v0.31.0`	Direct	Pre-existing import in `bootstrap_certificates_windows.go` — promoted from missing by `go mod tidy`
`golang.org/x/text v0.23.0`	Direct	Pre-existing import in `bootstrap_certificates_windows.go` — promoted from indirect by `go mod tidy`
`gopkg.in/yaml.v2 v2.4.0`	Direct	Pre-existing import in `collector_replicaset_config_helper.go` — promoted from missing by `go mod tidy`
`beorn7/perks`, `cespare/xxhash/v2`, `klauspost/compress`, `munnerz/goautoneg`, `prometheus/procfs`	Indirect	Transitive dependencies of `prometheus/client_golang`

Other go.mod files (otelcollector/go.mod, configuration-reader-builder/go.mod, shared/configmap/ccp/go.mod, shared/configmap/mp/go.mod):

All additions are // indirect — these modules depend on shared, so they inherit the transitive dependency on prometheus/client_golang.

otelcollector/prometheusreceiver/go.mod:

Whitespace cleanup only (removed blank lines). No dependency changes.

Testing

Unit tests: 6 test files with comprehensive coverage of parsing, scraping, delta accumulation, counter resets, mutex safety, and metric registration
End-to-end validation (manual, on standalone cluster):
- Deployed with valid config → verified all metrics emitting, invalid_metrics_settings_config = 0
- Applied munged configmap (v2 schema, missing controlplane-metrics) → verified metric = 1 with descriptive error text
- Deleted munged configmap → pod restarted, verified metric back to 0
- Confirmed no CrashLoopBackOff with the early-return fix
- See otelcollector/test/ccp/ccp-health-metrics-workflow.md for full test procedure

Non-CCP Impact

Zero impact on non-CCP builds. The CCP mode branch in main.go only activates when ccpMetricsEnabled == "true". All new Go files are in the shared package but only called from the CCP code path. The cleanSettingsError helper and improved error messages are scoped to configmapparserforccp.go only — helpers.go (shared by MP) is unchanged.

Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>

In CCP mode, fluent-bit is not started so the ME log parsing path that normally increments TimeseriesReceivedTotal/TimeseriesSentTotal/BytesSentTotal is not available. This adds a goroutine that periodically scrapes the otelcollector's own internal metrics endpoint (port 8888) and feeds deltas from otelcol_receiver_accepted_metric_points and otelcol_exporter_sent_metric_points into the shared health metric globals. This makes the health metrics endpoint on port 2234 report non-zero timeseries_received_per_minute and timeseries_sent_per_minute values when the otelcollector pipeline is actively processing data.

…metric_points Track otelcol_exporter_send_failed_metric_points counter deltas and feed them into OtelCollectorExportingFailedCount so the exporting_metrics_failed health metric on port 2234 reflects actual export failures in CCP mode.

Primary health metrics (timeseries_received/sent, bytes_sent) are now derived from ME stdout log parsing, matching what fluent-bit did without reintroducing it. The otelcol scraper is repurposed as a diagnostic supplement exposing otelcol_receiver_accepted, otelcol_exporter_sent, and otelcol_exporter_send_failed rates on port 2234. This enables distinguishing otelcol->ME failures from ME->workspace failures, closing the observability gap where otelcol metrics only showed sent-to-ME not ME-published-to-workspace. New files: - me_log_tailer.go: TailMELogs parses ME stdout for ProcessedCount, SentToPublicationCount, EventsProcessedLastPeriod regexes - me_log_tailer.go: TailOtelCollectorLogFile watches collector-log.txt Modified: - process_utilities_linux.go: Route ME stdout through TailMELogs in CCP - otelcol_health_scraper.go: Feed diagnostic globals instead of primary - health_metrics.go: 3 new diagnostic gauges + registration + ticker - main.go: Start otelcol log tailer in CCP mode

ME is started with -Logger File, which writes ProcessedCount and EventsProcessedLastPeriod lines to /MetricsExtensionConsoleDebugLog.log, not stdout. The previous implementation reading stdout only got startup messages. Changes: - Rename TailMELogs(io.Reader) to TailMELogFile(filePath string) - Tail /MetricsExtensionConsoleDebugLog.log using file poll (like TailOtelCollectorLogFile) - Revert process_utilities_linux.go: always copyOutputPipe for ME stdout - Start TailMELogFile from main.go CCP section - Update tests to use temp files

- Rename invalid_custom_prometheus_config -> invalid_metrics_settings_config for CCP mode - Change env var from AZMON_INVALID_CUSTOM_PROMETHEUS_CONFIG -> AZMON_INVALID_METRICS_SETTINGS_CONFIG - Add settings configmap validation: flag parsing errors (v1/v2 schema, file read) - Set INVALID_SETTINGS_CONFIG_ERROR with error details on parse failure - Initialize validation env vars at start of CCP parser (default: valid) - Non-CCP (fluent-bit) code unchanged - keeps original metric name

When the settings configmap has parse errors, continue with default configuration instead of returning early. The early return prevented the downstream collector config setup (config merger, validator) from running, which caused the otelcollector to fail to start and the liveness probe to fail. Now the error is recorded in AZMON_INVALID_METRICS_SETTINGS_CONFIG and the collector starts with defaults, allowing the health metric to properly report the invalid config state.

bragi92 · 2026-02-27T17:52:12Z

/azp run

azure-pipelines · 2026-02-27T17:52:19Z

No pipelines are associated with this pull request.

…nts overall component exporting failures

Implement canonical 3-tier metric naming for CCP pipeline health monitoring: - overall_* metrics: golden in/out/drop counters (pipeline-wide) - otelcol_* metrics: OtelCollector stage in/out/drop + export failures - me_* metrics: Metrics Extension stage in/out/drop

bragi92 · 2026-03-02T18:48:16Z

/azp run

azure-pipelines · 2026-03-02T18:48:23Z

No pipelines are associated with this pull request.

Copilot AI and others added 5 commits February 15, 2026 22:57

Initial plan

9834f49

Add health metrics support for CCP mode without fluent-bit

2d44745

Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>

Update go.mod and go.sum for health metrics dependencies

2fd36d5

Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>

Fix comments in health_metrics.go based on code review

7e172b4

Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>

Implementation complete - CCP mode now emits health metrics

a66f6a8

Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>

davidkydd requested a review from a team as a code owner February 15, 2026 23:16

Copilot AI and others added 13 commits February 15, 2026 23:23

Add comprehensive unit tests for health metrics

5190210

Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>

Add integration test proposal and sample test suite

6b00086

Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>

Update test README with CCP health metrics tests

b444cf4

Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>

Add comprehensive implementation summary

4b82f49

Co-authored-by: davidkydd <4119627+davidkydd@users.noreply.github.com>

add manual test instructions for ccp collector health metrics

5937ca0

Merge branch 'main' into copilot/update-ccp-health-metrics

9061c53

update message for invalid-metrics-settings-config metric and doc

04135f0

bragi92 previously approved these changes Feb 27, 2026

View reviewed changes

add separate otelExportingFailedMetric, exportingFailedMetric represe…

8cf6edf

…nts overall component exporting failures

davidkydd dismissed bragi92’s stale review via 8cf6edf March 1, 2026 23:39

bragi92 previously approved these changes Mar 2, 2026

View reviewed changes

davidkydd enabled auto-merge (squash) March 3, 2026 03:27

fix CVE, add devcontainer for running on macos

4135648

davidkydd dismissed bragi92’s stale review via 4135648 March 3, 2026 03:43

davidkydd added the size/XXL label Mar 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit health metrics for ccp without requiring fluent-bit#1414

Emit health metrics for ccp without requiring fluent-bit#1414
davidkydd wants to merge 21 commits intoAzure:mainfrom
davidkydd:copilot/update-ccp-health-metrics

davidkydd commented Feb 15, 2026 •

edited

Loading

Uh oh!

bragi92 commented Feb 27, 2026

Uh oh!

azure-pipelines bot commented Feb 27, 2026

Uh oh!

bragi92 commented Mar 2, 2026

Uh oh!

azure-pipelines bot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

davidkydd commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR: CCP Health Metrics — Expose Prometheus Collector Health Metrics in CCP Mode

Summary

Problem

Changes

New Files (CCP-only, in otelcollector/shared/)

Modified Files

Health Metrics Exposed (port 2234)

Architecture

Bug Fix: Early Return on Settings Parse Error

Error Message Improvements

Dependency Changes

Testing

Non-CCP Impact

Uh oh!

bragi92 commented Feb 27, 2026

Uh oh!

azure-pipelines bot commented Feb 27, 2026

Uh oh!

bragi92 commented Mar 2, 2026

Uh oh!

azure-pipelines bot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davidkydd commented Feb 15, 2026 •

edited

Loading

New Files (CCP-only, in `otelcollector/shared/`)