otel: replace cobraotel with native lifecycle management#3108
otel: replace cobraotel with native lifecycle management#3108Jdepp007004 wants to merge 3 commits into
Conversation
Fixes authzed#712 and authzed#3095. - Remove dependency on github.com/jzelinskie/cobrautil/v2/cobraotel - Replicate OTel provider initialization natively in pkg/cmd/server/otel.go - Wire TracerProvider into serve.go signal handler so Shutdown and ForceFlush are called on SIGTERM/SIGINT, preventing span loss on exit - Fix vendored cobrautil Viper global singleton bug: viper.SetEnvPrefix was mutating global state instead of the local instance (v.SetEnvPrefix) - Touch pkg/cmd/util/util.go only to break import cycle between pkg/cmd/util and pkg/cmd/server; all flag registrations unchanged - Add 20 tests across unit, integration, and system build tags
| ) | ||
|
|
||
| // otelProviderContextKey is the unexported context key for the TracerProvider. | ||
| type otelProviderContextKey struct{} |
There was a problem hiding this comment.
Done, switched to the ctxkey package to match how the rest of the codebase handles context keys.
| // Legacy flags | ||
| f.String("otel-jaeger-endpoint", "", "OpenTelemetry collector endpoint - the endpoint can also be set by using enviroment variables") | ||
| _ = f.MarkHidden("otel-jaeger-endpoint") | ||
| f.String("otel-jaeger-service-name", "spicedb", "service name for trace data") | ||
| _ = f.MarkHidden("otel-jaeger-service-name") |
There was a problem hiding this comment.
Dropped them from both RegisterOTelFlags and RegisterCommonFlags in util.go
| shutCtx, shutCancel := context.WithTimeout(ctx, OTelShutdownTimeout) | ||
| defer shutCancel() | ||
| return provider.Shutdown(shutCtx) |
There was a problem hiding this comment.
Why a separate context?
There was a problem hiding this comment.
there was no real reason for two separate contexts there, it was an oversight on my part ,
fixed to use one shared context for the whole operation
| if parent == nil { | ||
| parent = context.Background() | ||
| } |
There was a problem hiding this comment.
removed these too
| // it as the global OTel provider, and returns it for lifecycle management. | ||
| // | ||
| // Returns (nil, nil) when otel-provider is "none". Callers must handle nil. | ||
| func InitOTelProvider(cmd *cobra.Command) (otelShutdowner, error) { |
There was a problem hiding this comment.
cobraotel tied the provider to the lifecycle of the command because that was what made sense in that package, but I don't think we need or want it in this context.
The general flow for setting up components of SpiceDB is:
- Take cobra flags that are defined in the command and put their values into a config struct
- Take that config struct and construct the required object
- Pass that object down and in as necessary
The telemetry prometheus registry is probably a decent analogue.
There was a problem hiding this comment.
Restructured to follow the pattern you described. Flag values now get copied into an OTelConfig struct, InitOTelProvider takes that struct along with a context and has no dependency on Cobra anymore. The provider is constructed inside Complete and registered with closeables so shutdown is handled automatically as part of the normal server shutdown sequence.
| func RegisterCommonFlags(cmd *cobra.Command) { | ||
| otel := cobraotel.New("spicedb") | ||
| otel.RegisterFlags(cmd.Flags()) | ||
| f := cmd.Flags() |
There was a problem hiding this comment.
Thanks, glad it reads better there
| defer func() { | ||
| // Shutdown OTel provider to ensure all traces are flushed | ||
| if provider := server.OTelProviderFromContext(cmd.Context()); provider != nil { | ||
| if err := server.ShutdownOTelProvider(context.Background(), provider); err != nil { | ||
| log.Warn().Err(err).Msg("failed to cleanly shutdown OpenTelemetry provider") | ||
| } |
There was a problem hiding this comment.
Yeah, this makes sense, but I'd rather that we configure the otel provider here as well, not in the RunE
There was a problem hiding this comment.
why are we doing the flushing here and not putting the shutdown function of the provider inside of the closeables struct that the server already has?
spicedb/pkg/cmd/server/server.go
Lines 177 to 178 in ee7c9a7
There was a problem hiding this comment.
You're right, the defer approach was the wrong place for it. I wasn't fully across the closeables pattern when I first wrote that. Moved into Complete now so OTel shuts down in the same ordered sequence as everything else rather than racing against it.
| cmd.SetContext(context.Background()) | ||
| require.NoError(t, cmd.Flags().Set("otel-provider", "otlpgrpc")) | ||
| require.NoError(t, cmd.Flags().Set("otel-endpoint", "localhost:4317")) | ||
| require.NoError(t, cmd.Flags().Set("otel-insecure", "true")) |
There was a problem hiding this comment.
I generally like the tests. Can we add a test or two that establishes 1. that you can use the otel environment variables to configure flags that aren't set and 2. which value overrides which if you declare both?
There was a problem hiding this comment.
Done, added TestOTelConfig_EnvVarConfiguresUnsetFlag and TestOTelConfig_ExplicitFlagOverridesEnvVar to the integration test file. First one verifies the SDK picks up OTEL_EXPORTER_OTLP_ENDPOINT when the endpoint flag is not explicitly set, second one documents that an explicit flag value takes precedence over the env var.
miparnisari
left a comment
There was a problem hiding this comment.
Would it be possible to add a new struct within
spicedb/pkg/cmd/server/server.go
Line 63 in ee7c9a7
called otelConfig or something like that, so that anyone doing server.NewConfigWithOptionsAndDefaults can set these programmatically?
- use ctxkey package for context key instead of a bare struct type - drop legacy otel-jaeger-* flags - collapse ShutdownOTelProvider to a single context shared across ForceFlush and Shutdown - move OTel initialization out of the Cobra layer into Config.Complete via an OTelConfig struct, matching the pattern used by other components - register provider shutdown with closeables so it participates in the ordered server shutdown sequence - add env var precedence tests
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3108 +/- ##
==========================================
+ Coverage 75.52% 75.61% +0.10%
==========================================
Files 503 503
Lines 61820 62165 +345
==========================================
+ Hits 46683 47001 +318
- Misses 11722 11742 +20
- Partials 3415 3422 +7 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Addressing miparnisari's review — added OTelConfig as a struct in the server package with an OTel field on Config so it can be set programmatically without going through Cobra flags. InitOTelProvider now takes that struct directly along with a context and has no Cobra dependency at all. |
commited and pushed the vendor directory by mistake so another commit to fix it