Skip to content

Add OTEL config and wrappers#4985

Open
nikola-jokic wants to merge 6 commits into
masterfrom
nikola-jokic/observability-config
Open

Add OTEL config and wrappers#4985
nikola-jokic wants to merge 6 commits into
masterfrom
nikola-jokic/observability-config

Conversation

@nikola-jokic

Copy link
Copy Markdown
Contributor

What type of PR is this?

Enhancement

What this PR does / why we need it

Wire the observability packages to each service

Special notes for your reviewer

Depends on #4975

@nikola-jokic nikola-jokic marked this pull request as ready for review June 29, 2026 07:56
@greptile-apps

greptile-apps Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR wires OpenTelemetry tracing into all eight Armada services (server, scheduler, executor, binoculars, lookout, lookoutingester, eventingester, scheduleringester). It introduces a shared observability package with lifecycle management (InitOTel/ShutdownOTel), an ObservabilityConfig struct with validation, a SpanAttributePolicy processor enforcing allow-lists, deny-lists, and cardinality guardrails, and OTel gRPC/HTTP middleware wired at the transport layer.

  • Lifecycle: InitOTel is fail-open for collector unreachability but hard-fails on invalid sampler names; ShutdownWithDefaultTimeout is used by most services while eventingester uses ShutdownOTel with an inline context — both are functionally equivalent.
  • Attribute policy: SanitizeForSpan applies deny-list before allow-list checks, intentionally redacting any attribute key containing sensitive substrings (password, secret, token, api_key, apikey) regardless of OTel namespace prefix; cardinality is capped at 1,000 unique values per key.
  • Transport: otelgrpc.NewServerHandler() and NewClientHandler() are registered unconditionally on all gRPC servers/clients; the REST gateway wraps its handler with otelhttp.NewHandler so spans see the prefix-stripped path.

Confidence Score: 5/5

The changes are additive and isolated to the observability layer; all services retain their existing behavior when OTel is disabled or misconfigured (fail-open design).

No runtime-affecting bugs were found. The two findings are a non-deterministic map iteration in error-message formatting and a potentially over-broad deny-list substring pattern — neither affects correctness in production flows.

internal/common/observability/attribute_policy.go — the deny-list ordering and "token" substring pattern deserve a second look to confirm it doesn't accidentally redact intentional armada.* attributes.

Important Files Changed

Filename Overview
internal/common/observability/lifecycle.go New OTel lifecycle manager: InitOTel, ShutdownOTel, and shutdown helpers. Double-timeout nesting in shutdownTracerProvider is harmless but confusing; fail-open design for collector unreachability is intentional.
internal/common/observability/attribute_policy.go New span attribute sanitization policy with allow-list, deny-list, and cardinality guardrails. The deny-list's "token" contains-pattern may over-redact legitimate armada.* attributes.
internal/common/observability/config.go New ObservabilityConfig struct with Validate(). maps.Keys() used in two error messages produces non-deterministic output.
internal/common/grpc/grpc.go Adds otelgrpc.NewServerHandler() as gRPC stats handler; switches prometheus.MustRegister to prometheus.DefaultRegisterer.MustRegister — both changes look correct.
internal/common/grpc/gateway.go Wraps REST gateway handler with otelhttp.NewHandler; refactors stripPrefix branching — OTel span sees the stripped path when stripPrefix=true, which is the correct behavior.
internal/common/logging/logger.go Adds WithContext() to Logger, extracting trace_id/span_id from OTel span context and preserving x-request-id. Implementation looks correct.
pkg/client/connection.go Adds otelgrpc.NewClientHandler() to all gRPC client dial options unconditionally — correct approach, noop provider is already installed at package init.
cmd/eventingester/main.go OTel init now calls Fatalf on error; uses ShutdownOTel with an explicit 5-second context in defer rather than ShutdownWithDefaultTimeout — functionally equivalent but slightly different pattern from other services.
internal/common/observability/lifecycle_bootstrap_test.go Bootstrap integration tests for OTel lifecycle; previously flagged import of lookout/version from a common package test file.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Main as Service main()
    participant LC as lifecycle.go (InitOTel)
    participant Exp as OTLP Exporter
    participant TP as TracerProvider
    participant GRPC as gRPC Server/Client
    participant Coll as OTel Collector

    Main->>LC: InitOTel(cfg)
    alt "cfg.Enabled == false"
        LC-->>Main: setNoopOTel(), return nil
    else "cfg.Enabled == true"
        LC->>LC: build resource attributes
        LC->>Exp: newTraceExporter(ctx, cfg)
        alt exporter creation fails
            Exp-->>LC: error
            LC->>LC: setNoopOTel() fail-open
            LC-->>Main: return nil
        else exporter creation succeeds
            Exp-->>LC: SpanExporter
            LC->>TP: sdktrace.NewTracerProvider(sampler, batcher, policy processor)
            LC->>LC: otel.SetTracerProvider(tp)
            LC-->>Main: return nil
        end
    end

    Main->>GRPC: CreateGrpcServer() / CreateApiConnection()
    Note over GRPC: otelgrpc.NewServerHandler / NewClientHandler registered

    GRPC->>Coll: OTLP span export (batched, max 512, queue 2048)

    Main->>LC: defer ShutdownWithDefaultTimeout()
    LC->>TP: tp.Shutdown(ctx 5s)
    TP->>Exp: ForceFlush + Shutdown
    Exp->>Coll: flush remaining spans
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Main as Service main()
    participant LC as lifecycle.go (InitOTel)
    participant Exp as OTLP Exporter
    participant TP as TracerProvider
    participant GRPC as gRPC Server/Client
    participant Coll as OTel Collector

    Main->>LC: InitOTel(cfg)
    alt "cfg.Enabled == false"
        LC-->>Main: setNoopOTel(), return nil
    else "cfg.Enabled == true"
        LC->>LC: build resource attributes
        LC->>Exp: newTraceExporter(ctx, cfg)
        alt exporter creation fails
            Exp-->>LC: error
            LC->>LC: setNoopOTel() fail-open
            LC-->>Main: return nil
        else exporter creation succeeds
            Exp-->>LC: SpanExporter
            LC->>TP: sdktrace.NewTracerProvider(sampler, batcher, policy processor)
            LC->>LC: otel.SetTracerProvider(tp)
            LC-->>Main: return nil
        end
    end

    Main->>GRPC: CreateGrpcServer() / CreateApiConnection()
    Note over GRPC: otelgrpc.NewServerHandler / NewClientHandler registered

    GRPC->>Coll: OTLP span export (batched, max 512, queue 2048)

    Main->>LC: defer ShutdownWithDefaultTimeout()
    LC->>TP: tp.Shutdown(ctx 5s)
    TP->>Exp: ForceFlush + Shutdown
    Exp->>Coll: flush remaining spans
Loading

Reviews (5): Last reviewed commit: "Add resources to config" | Re-trigger Greptile

Comment thread cmd/eventingester/main.go
Comment on lines +11 to +12
"github.com/armadaproject/armada/internal/lookout/version"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Common package test imports service-specific lookout/version

internal/common/observability is a shared foundation package consumed by all services. Importing internal/lookout/version from its test file introduces a dependency from a common/infra package onto a specific service package. This inverts the intended dependency direction, makes the observability package's test suite depend on the lookout service build, and risks creating transitive import cycles. The bootstrap test just needs an arbitrary non-empty version string — using a literal like "v0.0.0-test" or reading from a test constant removes the cross-package dependency entirely.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment thread internal/testsuite/app.go Outdated
Comment thread _local/server/config.yaml
@nikola-jokic nikola-jokic force-pushed the nikola-jokic/observability-config branch from 1e00005 to 55a2bed Compare June 29, 2026 10:46
@datadog-armadaproject

Copy link
Copy Markdown

Pipelines

⚠️ Warnings

🚦 1 Pipeline job failed

CI | build / prepare   View in Datadog   GitHub Actions

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 55a2bed | Docs | Give us feedback!

@nikola-jokic nikola-jokic force-pushed the nikola-jokic/observability-config branch from 55a2bed to f334bf0 Compare June 29, 2026 10:55
@nikola-jokic nikola-jokic force-pushed the nikola-jokic/observability-config branch from f334bf0 to 83f7530 Compare June 29, 2026 11:21
@nikola-jokic nikola-jokic changed the title Nikola jokic/observability config Add OTEL config and wrappers Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant