Add OTEL config and wrappers#4985
Conversation
Greptile SummaryThis PR wires OpenTelemetry tracing into all eight Armada services (server, scheduler, executor, binoculars, lookout, lookoutingester, eventingester, scheduleringester). It introduces a shared
Confidence Score: 5/5The changes are additive and isolated to the observability layer; all services retain their existing behavior when OTel is disabled or misconfigured (fail-open design). No runtime-affecting bugs were found. The two findings are a non-deterministic map iteration in error-message formatting and a potentially over-broad deny-list substring pattern — neither affects correctness in production flows. internal/common/observability/attribute_policy.go — the deny-list ordering and "token" substring pattern deserve a second look to confirm it doesn't accidentally redact intentional armada.* attributes. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant Main as Service main()
participant LC as lifecycle.go (InitOTel)
participant Exp as OTLP Exporter
participant TP as TracerProvider
participant GRPC as gRPC Server/Client
participant Coll as OTel Collector
Main->>LC: InitOTel(cfg)
alt "cfg.Enabled == false"
LC-->>Main: setNoopOTel(), return nil
else "cfg.Enabled == true"
LC->>LC: build resource attributes
LC->>Exp: newTraceExporter(ctx, cfg)
alt exporter creation fails
Exp-->>LC: error
LC->>LC: setNoopOTel() fail-open
LC-->>Main: return nil
else exporter creation succeeds
Exp-->>LC: SpanExporter
LC->>TP: sdktrace.NewTracerProvider(sampler, batcher, policy processor)
LC->>LC: otel.SetTracerProvider(tp)
LC-->>Main: return nil
end
end
Main->>GRPC: CreateGrpcServer() / CreateApiConnection()
Note over GRPC: otelgrpc.NewServerHandler / NewClientHandler registered
GRPC->>Coll: OTLP span export (batched, max 512, queue 2048)
Main->>LC: defer ShutdownWithDefaultTimeout()
LC->>TP: tp.Shutdown(ctx 5s)
TP->>Exp: ForceFlush + Shutdown
Exp->>Coll: flush remaining spans
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant Main as Service main()
participant LC as lifecycle.go (InitOTel)
participant Exp as OTLP Exporter
participant TP as TracerProvider
participant GRPC as gRPC Server/Client
participant Coll as OTel Collector
Main->>LC: InitOTel(cfg)
alt "cfg.Enabled == false"
LC-->>Main: setNoopOTel(), return nil
else "cfg.Enabled == true"
LC->>LC: build resource attributes
LC->>Exp: newTraceExporter(ctx, cfg)
alt exporter creation fails
Exp-->>LC: error
LC->>LC: setNoopOTel() fail-open
LC-->>Main: return nil
else exporter creation succeeds
Exp-->>LC: SpanExporter
LC->>TP: sdktrace.NewTracerProvider(sampler, batcher, policy processor)
LC->>LC: otel.SetTracerProvider(tp)
LC-->>Main: return nil
end
end
Main->>GRPC: CreateGrpcServer() / CreateApiConnection()
Note over GRPC: otelgrpc.NewServerHandler / NewClientHandler registered
GRPC->>Coll: OTLP span export (batched, max 512, queue 2048)
Main->>LC: defer ShutdownWithDefaultTimeout()
LC->>TP: tp.Shutdown(ctx 5s)
TP->>Exp: ForceFlush + Shutdown
Exp->>Coll: flush remaining spans
Reviews (5): Last reviewed commit: "Add resources to config" | Re-trigger Greptile |
| "github.com/armadaproject/armada/internal/lookout/version" | ||
| ) |
There was a problem hiding this comment.
Common package test imports service-specific
lookout/version
internal/common/observability is a shared foundation package consumed by all services. Importing internal/lookout/version from its test file introduces a dependency from a common/infra package onto a specific service package. This inverts the intended dependency direction, makes the observability package's test suite depend on the lookout service build, and risks creating transitive import cycles. The bootstrap test just needs an arbitrary non-empty version string — using a literal like "v0.0.0-test" or reading from a test constant removes the cross-package dependency entirely.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
1e00005 to
55a2bed
Compare
55a2bed to
f334bf0
Compare
f334bf0 to
83f7530
Compare
What type of PR is this?
Enhancement
What this PR does / why we need it
Wire the observability packages to each service
Special notes for your reviewer
Depends on #4975