Skip to content

Controller shutdown doesn't wait for reconciles to finish #491

@rsafonseca

Description

@rsafonseca

Every time the image-reflector-controller pod goes down for any reason, it isn't waiting for running reconciliations to finish, and propagates an immediate context cancellation to the running contexts.

This causes running reconciliations to fail and return an error, which in turn is propagated to all channels defined in the notification controller, generating a lot of noise.

Ideally, the running reconciliations should be allowed to finish, or at the worst these expected context cancellation shutdown errors should not be propagated to the notification channels.

Here's an example log of the issue:

{"level":"info","ts":"2024-01-11T11:41:01.305Z","msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":"2024-01-11T11:41:01.312Z","msg":"Stopping and waiting for leader election runnables"}
{"level":"info","ts":"2024-01-11T11:41:01.324Z","msg":"All workers finished","controller":"imagepolicy","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImagePolicy"}
{"level":"info","ts":"2024-01-11T11:41:01.312Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"imagepolicy","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImagePolicy"}
{"level":"info","ts":"2024-01-11T11:41:01.312Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository"}
{"level":"error","ts":"2024-01-11T11:41:01.332Z","msg":"failed to configure authentication options: operation error ECR: GetAuthorizationToken, get identity: get credentials: request canceled, context canceled","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository","ImageRepository":{"name":"**REDACTED**","namespace":"**REDACTED**"},"namespace":"**REDACTED**","name":"**REDACTED**","reconcileID":"64c8c4f2-e200-4466-bb1a-0059375fc0d6","error":"AuthenticationFailed"}
{"level":"info","ts":"2024-01-11T11:41:01.338Z","msg":"Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository","ImageRepository":{"name":"**REDACTED**","namespace":"**REDACTED**"},"namespace":"**REDACTED**","name":"**REDACTED**","reconcileID":"9f627893-53d1-4bdc-acac-a0b3099216e5"}
{"level":"error","ts":"2024-01-11T11:41:01.339Z","msg":"Reconciler error","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository","ImageRepository":{"name":"**REDACTED**","namespace":"**REDACTED**"},"namespace":"**REDACTED**","name":"**REDACTED**","reconcileID":"9f627893-53d1-4bdc-acac-a0b3099216e5","error":"[Patch \"https://10.16.0.1:443/apis/image.toolkit.fluxcd.io/v1beta2/namespaces/**REDACTED**/imagerepositories/**REDACTED**/status?fieldManager=image-reflector-controller\": context canceled, context canceled]","errorCauses":[{"error":"[Patch \"https://10.16.0.1:443/apis/image.toolkit.fluxcd.io/v1beta2/namespaces/**REDACTED**/imagerepositories/**REDACTED**/status?fieldManager=image-reflector-controller\": context canceled, context canceled]","errorCauses":[{"error":"Patch \"https://10.16.0.1:443/apis/image.toolkit.fluxcd.io/v1beta2/namespaces/**REDACTED**/imagerepositories/**REDACTED**/status?fieldManager=image-reflector-controller\": context canceled"},{"error":"context canceled"}]}]}
{"level":"error","ts":"2024-01-11T11:41:01.373Z","msg":"Reconciler error","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository","ImageRepository":{"name":"**REDACTED**","namespace":"**REDACTED**"},"namespace":"**REDACTED**","name":"**REDACTED**","reconcileID":"64c8c4f2-e200-4466-bb1a-0059375fc0d6","error":"[failed to configure authentication options: operation error ECR: GetAuthorizationToken, get identity: get credentials: request canceled, context canceled, context canceled]","errorCauses":[{"error":"failed to configure authentication options: operation error ECR: GetAuthorizationToken, get identity: get credentials: request canceled, context canceled"},{"error":"context canceled","errorCauses":[{"error":"context canceled"},{"error":"context canceled"}]}]}
{"level":"info","ts":"2024-01-11T11:41:01.375Z","msg":"All workers finished","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository"}
{"level":"info","ts":"2024-01-11T11:41:01.375Z","msg":"Stopping and waiting for caches"}
{"level":"info","ts":"2024-01-11T11:41:01.384Z","msg":"Stopping and waiting for webhooks"}
{"level":"info","ts":"2024-01-11T11:41:01.384Z","msg":"Stopping and waiting for HTTP servers"}
{"level":"info","ts":"2024-01-11T11:41:01.384Z","msg":"shutting down server","kind":"health probe","addr":"[::]:9440"}
{"level":"info","ts":"2024-01-11T11:41:01.389Z","logger":"controller-runtime.metrics","msg":"Shutting down metrics server with timeout of 1 minute"}
{"level":"info","ts":"2024-01-11T11:41:01.396Z","msg":"Wait completed, proceeding to shutdown the manager"}
{"level":"error","ts":"2024-01-11T11:41:01.408Z","msg":"error received after stop sequence was engaged","error":"leader election lost"}

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/uxIn pursuit of a delightful user experienceenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions