Controller, bugfix: Call EventRecorder.AnnotatedEventf in a go-routine to make it non-blocking. #1479
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Quick notes:
Failure Scenario summary
Following my issue described in fluxcd/flux2#5403, I did a bit of digging. I enabled the debug log of the Kustomize controller, and found the following scenario happening:
KaandKb.Kbdepends onKa.KaandKbtries to reconcile every 2 seconds; they fail withDependencies do not meet ready condition, retrying in 2s. This is expected.Kacomes into place, andKastarts to reconcile.Kbkeeps trying to reconcile every 2 seconds for some time! Then it stops / dies.Kais done reconciling, andKbshould be ready to go, but nothing happens!Kbstarts to reconcile.See details below.
Apparent cause – and fix
It seems that this scenario is the combination of two bugs:
Bug 1: Communication to Notification-controller
For some reason; Kustomize-controller cannot communicate with Notification-controller in our cluster. It fails to
[...] to record eventand it gives up after ~5 minutes after start of Notification-controller.This may very well be a problem in our setup that I will have to investigate.
Bug 2: Blocking of other communication
It seems that the communication loop used to update the perceived reconciliation status of
KaandKbis the same loop used for the Notification controller. Hence: While the thread is blocked trying (and failing) to communicate with Notification Controller, Kustomization controller cannot update its perceived status of Kustomizations.=> This is why
Kbfails to start reconciling whenKais done! Kustomization controller simply does not get the info thatKais done.My quick fix (read: This may very well be solved in a better way) is to push the "send the message Notification Controller" into its own co-routing, effectively unblocking other communication.
The effect? Now, when Ka is done reconciling, Kb starts briefly there-after. 🥳
Would you have a look and see if this fix is good, or if something else can be done?
Thank you 🙏
Details
Sequence diagram
Here are 2 sequence diagrams of a) what I expect to happen, and b) what I see is happening.
Note: I am a bit uncertain about the actors... That is why I wrote question marks next to "API-server".
Expected behavior
Observed behavior
Log output
Notes:
t_secondsis seconds since start first log statement from Kustomize Controller.Kain the example above iskyverno-system-controllersin the log files.Kbin the example above iskyverno-systemin the log files.kyverno-systemdepends onkyverno-system-controllersReconciliationSucceededmessages each time... but there is...Log output, running Kustomize Controller from
mainbranchNotable timestamps:
Log output, running Kustomize Controller from this branch
Notable timestamps: