-
Notifications
You must be signed in to change notification settings - Fork 280
feat(sync-service): Keep consumers on loss of replication client #3238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for electric-next ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
alco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work!
Replication.Supervisor (or perhaps it should be called the ShapeSystemSupervisor)
Connection.Manager.Supervisor (or perhaps it should be called the ConnectionSystemSupervisor)
Sounds good to me, but I would keep it consistent with the code comments and call them ShapeSubsystemSupervisor and ConnectionSubsystemSupervisor.
The only change that bugs me is that the whole replication supervisor has to be started before the connection manager starts initializing database connections. Since the two subsystems can now fail independently of each other, it would only be fair for them to start in parallel and reduce the time until the stack becomes ready.
The consumers_ready message needs to be transformed into a StatusMonitor entry which the connection manager can look up and wait on.
If the connection manager detects a timeline change, there's no other recourse than terminating the shape subsystem (even if it's still starting up) and starting it with a clean state. I don't particularly see the reason to treat the case of count_shapes==0 differently: this case seems like a rare outlier.
Regarding the coordination with my PR, I can rebase yours on top of mine myself, if mine is merged first. I don't think it makes sense for you to switch context to the new way of config passing here, you can just keep working in context of main and bring the PR to completion.
It will be easier for me to rebase it later since that's basically what I've already done in my PR, a number of times actually as I've had to resolve conflicts with multiple other PRs that got merged to main.
| # `Electric.Replication.Supervisor` which is responsible for starting the shape log collector | ||
| # and individual shape consumer process trees. | ||
| # | ||
| # See the moduledoc in `Electric.Connection.Supervisor` for more info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole comment is outdated and badly misplaced, I would suggest removing it and adding the following paragraph to the comment block just above the def children_application line:
# The root application supervisor starts the core service processes, such as the HTTP
# server for the HTTP API and telemetry exporters, and a single StackSupervisor, basically
# making the application run in a single-tenant mode where all API requests are forwarded
# to that sole tenant.
It's not rare at all, it happens when the system first starts up because the slot is new. It would seem odd to start the replication client twice on first start-up. I specifically added that condition for this case. |
So there's no way as far as I know to do that in the supervision tree, so I assume you mean for ShapeCache to load the shapes as part of a |
Ah, I see. Because the new slot creation sets I don't think we should reset storage in this case. When the system first starts up, it doesn't have any timeline stored, so the timeline check returns So what should be happening is this:
If we do have a previous timeline stored, then state.purge_all_shapes is checked together with the timeline check to decide whether to reset the storage. |
Yes, that seems to be the way to do it. But we can no longer just send a message to connection manager because it may not be up by that time. So the coordination about consumers being ready needs to happen via StatusMonitor, IMO. |
I like the idea of skipping the reset, but it's not the timeline that triggers it, it's (edit) Oh, actually you're saying I could use the |
Where we called |
|
@robacourt oh yes, this way it's even better: when Connection Manager is ready to start streaming, it checks whether the ShapeLogCollector itself is ready. If not, it waits for it to become ready and then instructs the replication client to start streaming. You've hit the nail on the head! |
b20359f to
f19a25f
Compare
f19a25f to
47f588a
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3238 +/- ##
==========================================
- Coverage 76.98% 76.96% -0.03%
==========================================
Files 180 181 +1
Lines 9639 9652 +13
Branches 334 333 -1
==========================================
+ Hits 7421 7429 +8
- Misses 2216 2221 +5
Partials 2 2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Currently connection errors will restart the ReplicationClient, ShapeLogCollector and all shape Consumers, of which there can be many thousands. This can be slow and cause errors due to the processes not being available to the API.
This PR defines a connection subsystem and a shape subsystem:
These two subsystems can now go down independently. Timeline changes or new slots now trigger the shape system to be restarted.
The supervision tree now looks like this:
StackSupervisor
-utility processes such as the EtsInspector that can be restarted indepently
-MonitoredCoreSupervisor
This PR does not look to address making the ShapeLogCollector resilient to connection failure, that will be done in a separate PR.
This PR is currently a WIP due to merge issues. It would be good to get some feedback on the approach and then I suspect I'll need to make quite a few changes to rebase over @alco's PRs (#3198 and #3230)