Skip to content

otel: replace cobraotel with native lifecycle management#3108

Open
Jdepp007004 wants to merge 3 commits into
authzed:mainfrom
Jdepp007004:fix/otel-lifecycle-native
Open

otel: replace cobraotel with native lifecycle management#3108
Jdepp007004 wants to merge 3 commits into
authzed:mainfrom
Jdepp007004:fix/otel-lifecycle-native

Conversation

@Jdepp007004
Copy link
Copy Markdown
Contributor

commited and pushed the vendor directory by mistake so another commit to fix it

Fixes authzed#712 and authzed#3095.

- Remove dependency on github.com/jzelinskie/cobrautil/v2/cobraotel
- Replicate OTel provider initialization natively in pkg/cmd/server/otel.go
- Wire TracerProvider into serve.go signal handler so Shutdown and
  ForceFlush are called on SIGTERM/SIGINT, preventing span loss on exit
- Fix vendored cobrautil Viper global singleton bug: viper.SetEnvPrefix
  was mutating global state instead of the local instance (v.SetEnvPrefix)
- Touch pkg/cmd/util/util.go only to break import cycle between
  pkg/cmd/util and pkg/cmd/server; all flag registrations unchanged
- Add 20 tests across unit, integration, and system build tags
@Jdepp007004 Jdepp007004 requested a review from a team as a code owner May 9, 2026 14:47
@github-actions github-actions Bot added area/cli Affects the command line area/dependencies Affects dependencies area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) labels May 9, 2026
@Jdepp007004 Jdepp007004 closed this May 9, 2026
@Jdepp007004 Jdepp007004 reopened this May 9, 2026
@Jdepp007004 Jdepp007004 changed the title removes vendor directory from tracking otel: replace cobraotel with native lifecycle management May 9, 2026
@github-actions github-actions Bot locked and limited conversation to collaborators May 9, 2026
@authzed authzed unlocked this conversation May 13, 2026
Copy link
Copy Markdown
Contributor

@tstirrat15 tstirrat15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

Comment thread pkg/cmd/server/otel.go Outdated
)

// otelProviderContextKey is the unexported context key for the TracerProvider.
type otelProviderContextKey struct{}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ctxkey generally

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, switched to the ctxkey package to match how the rest of the codebase handles context keys.

Comment thread pkg/cmd/server/otel.go Outdated
Comment on lines +67 to +71
// Legacy flags
f.String("otel-jaeger-endpoint", "", "OpenTelemetry collector endpoint - the endpoint can also be set by using enviroment variables")
_ = f.MarkHidden("otel-jaeger-endpoint")
f.String("otel-jaeger-service-name", "spicedb", "service name for trace data")
_ = f.MarkHidden("otel-jaeger-service-name")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop these

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped them from both RegisterOTelFlags and RegisterCommonFlags in util.go

Comment thread pkg/cmd/server/otel.go Outdated
Comment on lines +207 to +209
shutCtx, shutCancel := context.WithTimeout(ctx, OTelShutdownTimeout)
defer shutCancel()
return provider.Shutdown(shutCtx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a separate context?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was no real reason for two separate contexts there, it was an oversight on my part ,
fixed to use one shared context for the whole operation

Comment thread pkg/cmd/server/otel.go Outdated
Comment on lines +220 to +222
if parent == nil {
parent = context.Background()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same q here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed these too

Comment thread pkg/cmd/server/otel.go Outdated
// it as the global OTel provider, and returns it for lifecycle management.
//
// Returns (nil, nil) when otel-provider is "none". Callers must handle nil.
func InitOTelProvider(cmd *cobra.Command) (otelShutdowner, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cobraotel tied the provider to the lifecycle of the command because that was what made sense in that package, but I don't think we need or want it in this context.

The general flow for setting up components of SpiceDB is:

  1. Take cobra flags that are defined in the command and put their values into a config struct
  2. Take that config struct and construct the required object
  3. Pass that object down and in as necessary

The telemetry prometheus registry is probably a decent analogue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restructured to follow the pattern you described. Flag values now get copied into an OTelConfig struct, InitOTelProvider takes that struct along with a context and has no dependency on Cobra anymore. The provider is constructed inside Complete and registered with closeables so shutdown is handled automatically as part of the normal server shutdown sequence.

Comment thread pkg/cmd/util/util.go
Comment on lines 441 to +442
func RegisterCommonFlags(cmd *cobra.Command) {
otel := cobraotel.New("spicedb")
otel.RegisterFlags(cmd.Flags())
f := cmd.Flags()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, glad it reads better there

Comment thread pkg/cmd/serve.go Outdated
Comment on lines +272 to +277
defer func() {
// Shutdown OTel provider to ensure all traces are flushed
if provider := server.OTelProviderFromContext(cmd.Context()); provider != nil {
if err := server.ShutdownOTelProvider(context.Background(), provider); err != nil {
log.Warn().Err(err).Msg("failed to cleanly shutdown OpenTelemetry provider")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this makes sense, but I'd rather that we configure the otel provider here as well, not in the RunE

Copy link
Copy Markdown
Contributor

@miparnisari miparnisari May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we doing the flushing here and not putting the shutdown function of the provider inside of the closeables struct that the server already has?

func (c *Config) Complete(ctx context.Context) (RunnableServer, error) {
closeables := util.CloseableStack{}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the defer approach was the wrong place for it. I wasn't fully across the closeables pattern when I first wrote that. Moved into Complete now so OTel shuts down in the same ordered sequence as everything else rather than racing against it.

Comment thread pkg/cmd/server/otel_integration_test.go Outdated
Comment on lines +22 to +25
cmd.SetContext(context.Background())
require.NoError(t, cmd.Flags().Set("otel-provider", "otlpgrpc"))
require.NoError(t, cmd.Flags().Set("otel-endpoint", "localhost:4317"))
require.NoError(t, cmd.Flags().Set("otel-insecure", "true"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally like the tests. Can we add a test or two that establishes 1. that you can use the otel environment variables to configure flags that aren't set and 2. which value overrides which if you declare both?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added TestOTelConfig_EnvVarConfiguresUnsetFlag and TestOTelConfig_ExplicitFlagOverridesEnvVar to the integration test file. First one verifies the SDK picks up OTEL_EXPORTER_OTLP_ENDPOINT when the endpoint flag is not explicitly set, second one documents that an explicit flag value takes precedence over the env var.

Copy link
Copy Markdown
Contributor

@miparnisari miparnisari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add a new struct within

type Config struct {

called otelConfig or something like that, so that anyone doing server.NewConfigWithOptionsAndDefaults can set these programmatically?

- use ctxkey package for context key instead of a bare struct type
- drop legacy otel-jaeger-* flags
- collapse ShutdownOTelProvider to a single context shared across
  ForceFlush and Shutdown
- move OTel initialization out of the Cobra layer into Config.Complete
  via an OTelConfig struct, matching the pattern used by other components
- register provider shutdown with closeables so it participates in the
  ordered server shutdown sequence
- add env var precedence tests
@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

❌ Patch coverage is 56.35359% with 79 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.61%. Comparing base (ee7c9a7) to head (edc506d).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
pkg/cmd/server/otel.go 59.87% 54 Missing and 5 partials ⚠️
pkg/cmd/server/server.go 26.67% 10 Missing and 1 partial ⚠️
pkg/cmd/serve.go 10.00% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3108      +/-   ##
==========================================
+ Coverage   75.52%   75.61%   +0.10%     
==========================================
  Files         503      503              
  Lines       61820    62165     +345     
==========================================
+ Hits        46683    47001     +318     
- Misses      11722    11742      +20     
- Partials     3415     3422       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Jdepp007004
Copy link
Copy Markdown
Contributor Author

Addressing miparnisari's review — added OTelConfig as a struct in the server package with an OTel field on Config so it can be set programmatically without going through Cobra flags. InitOTelProvider now takes that struct directly along with a context and has no Cobra dependency at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli Affects the command line area/dependencies Affects dependencies area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants