feat(sync-service): Keep consumers on loss of replication client #3238

robacourt · 2025-10-06T12:43:41Z

Currently connection errors will restart the ReplicationClient, ShapeLogCollector and all shape Consumers, of which there can be many thousands. This can be slow and cause errors due to the processes not being available to the API.

This PR defines a connection subsystem and a shape subsystem:

The connection subsystem (processes that may exit on a connection failure), started with the Connection.Manager.Supervisor
The shape subsystem (processes that are resilient to connection failures), started with Electric.Replication.Supervisor

These two subsystems can now go down independently. Timeline changes or new slots now trigger the shape system to be restarted.

The supervision tree now looks like this:

StackSupervisor
-utility processes such as the EtsInspector that can be restarted indepently
-MonitoredCoreSupervisor

StatusMonitor
CoreSupervisor
- Replication.Supervisor (or perhaps it should be called the ShapeSystemSupervisor)
- Connection.Manager.Supervisor (or perhaps it should be called the ConnectionSystemSupervisor)

This PR does not look to address making the ShapeLogCollector resilient to connection failure, that will be done in a separate PR.

This PR is currently a WIP due to merge issues. It would be good to get some feedback on the approach and then I suspect I'll need to make quite a few changes to rebase over @alco's PRs (#3198 and #3230)

netlify · 2025-10-06T12:46:37Z

✅ Deploy Preview for electric-next ready!

Name	Link
🔨 Latest commit	`47f588a`
🔍 Latest deploy log	https://app.netlify.com/projects/electric-next/deploys/68e6553cfd58ef0008e5b89e
😎 Deploy Preview	https://deploy-preview-3238--electric-next.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

alco

Amazing work!

Replication.Supervisor (or perhaps it should be called the ShapeSystemSupervisor)
Connection.Manager.Supervisor (or perhaps it should be called the ConnectionSystemSupervisor)

Sounds good to me, but I would keep it consistent with the code comments and call them ShapeSubsystemSupervisor and ConnectionSubsystemSupervisor.

The only change that bugs me is that the whole replication supervisor has to be started before the connection manager starts initializing database connections. Since the two subsystems can now fail independently of each other, it would only be fair for them to start in parallel and reduce the time until the stack becomes ready.

The consumers_ready message needs to be transformed into a StatusMonitor entry which the connection manager can look up and wait on.

If the connection manager detects a timeline change, there's no other recourse than terminating the shape subsystem (even if it's still starting up) and starting it with a clean state. I don't particularly see the reason to treat the case of count_shapes==0 differently: this case seems like a rare outlier.

Regarding the coordination with my PR, I can rebase yours on top of mine myself, if mine is merged first. I don't think it makes sense for you to switch context to the new way of config passing here, you can just keep working in context of main and bring the PR to completion.

It will be easier for me to rebase it later since that's basically what I've already done in my PR, a number of times actually as I've had to resolve conflicts with multiple other PRs that got merged to main.

alco · 2025-10-07T10:28:32Z

packages/sync-service/lib/electric/application.ex

-    # `Electric.Replication.Supervisor` which is responsible for starting the shape log collector
-    # and individual shape consumer process trees.
-    #
-    # See the moduledoc in `Electric.Connection.Supervisor` for more info.


This whole comment is outdated and badly misplaced, I would suggest removing it and adding the following paragraph to the comment block just above the def children_application line:

# The root application supervisor starts the core service processes, such as the HTTP # server for the HTTP API and telemetry exporters, and a single StackSupervisor, basically # making the application run in a single-tenant mode where all API requests are forwarded # to that sole tenant.

robacourt · 2025-10-07T12:44:53Z

I don't particularly see the reason to treat the case of count_shapes==0 differently: this case seems like a rare outlier.

It's not rare at all, it happens when the system first starts up because the slot is new. It would seem odd to start the replication client twice on first start-up. I specifically added that condition for this case.

robacourt · 2025-10-07T12:50:30Z

The only change that bugs me is that the whole replication supervisor has to be started before the connection manager starts initializing database connections. Since the two subsystems can now fail independently of each other, it would only be fair for them to start in parallel and reduce the time until the stack becomes ready.

So there's no way as far as I know to do that in the supervision tree, so I assume you mean for ShapeCache to load the shapes as part of a :continue rather than as part of init?

alco · 2025-10-07T13:34:59Z

I don't particularly see the reason to treat the case of count_shapes==0 differently: this case seems like a rare outlier.

It's not rare at all, it happens when the system first starts up because the slot is new. It would seem odd to start the replication client twice on first start-up. I specifically added that condition for this case.

Ah, I see. Because the new slot creation sets purge_all_shapes to true.

I don't think we should reset storage in this case. When the system first starts up, it doesn't have any timeline stored, so the timeline check returns :ok. We should be able to identify this state and skip resetting the storage.

So what should be happening is this:

timeline check returns :no_previous_timeline or similar
conn man skips checking state.purge_all_shapes and just moves on

If we do have a previous timeline stored, then state.purge_all_shapes is checked together with the timeline check to decide whether to reset the storage.

alco · 2025-10-07T13:36:56Z

The only change that bugs me is that the whole replication supervisor has to be started before the connection manager starts initializing database connections. Since the two subsystems can now fail independently of each other, it would only be fair for them to start in parallel and reduce the time until the stack becomes ready.

So there's no way as far as I know to do that in the supervision tree, so I assume you mean for ShapeCache to load the shapes as part of a :continue rather than as part of init?

Yes, that seems to be the way to do it. But we can no longer just send a message to connection manager because it may not be up by that time. So the coordination about consumers being ready needs to happen via StatusMonitor, IMO.

robacourt · 2025-10-07T16:09:10Z

I don't think we should reset storage in this case. When the system first starts up, it doesn't have any timeline stored, so the timeline check returns :ok. We should be able to identify this state and skip resetting the storage.

I like the idea of skipping the reset, but it's not the timeline that triggers it, it's replication_client_created_new_slot being sent by the ReplicationClient. I'm not quite sure at this stage how I would stop that.

(edit) Oh, actually you're saying I could use the :no_previous_timeline to not check purge_all_shapes. Ok, that could work. Thanks!

robacourt · 2025-10-08T10:52:19Z

The consumers_ready message needs to be transformed into a StatusMonitor entry which the connection manager can look up and wait on.

Where we called consumers_ready we calling ShapeLogCollector.set_last_processed_lsn which in turn calls StatusMonitor.mark_shape_log_collector_ready so I assume that should be enough unless I'm missing something?

alco · 2025-10-08T11:10:40Z

@robacourt oh yes, this way it's even better: when Connection Manager is ready to start streaming, it checks whether the ShapeLogCollector itself is ready. If not, it waits for it to become ready and then instructs the replication client to start streaming. You've hit the nail on the head!

codecov · 2025-10-08T12:13:54Z

Codecov Report

❌ Patch coverage is 70.93023% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.96%. Comparing base (603d5ef) to head (005cab3).
⚠️ Report is 110 commits behind head on main.

Files with missing lines	Patch %	Lines
...es/sync-service/lib/electric/connection/manager.ex	0.00%	10 Missing ⚠️
...kages/sync-service/lib/electric/core_supervisor.ex	85.18%	8 Missing ⚠️
...vice/lib/electric/connection/manager/supervisor.ex	16.66%	5 Missing ⚠️
...ckages/sync-service/lib/electric/status_monitor.ex	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3238      +/-   ##
==========================================
- Coverage   76.98%   76.96%   -0.03%     
==========================================
  Files         180      181       +1     
  Lines        9639     9652      +13     
  Branches      334      333       -1     
==========================================
+ Hits         7421     7429       +8     
- Misses       2216     2221       +5     
  Partials        2        2

Flag	Coverage Δ
elixir	`75.26% <70.93%> (-0.03%)`	⬇️
elixir-client	`73.94% <ø> (-0.53%)`	⬇️
packages/experimental	`87.73% <ø> (ø)`
packages/react-hooks	`86.48% <ø> (ø)`
packages/typescript-client	`94.37% <ø> (ø)`
packages/y-electric	`55.12% <ø> (ø)`
postgres-140000	`75.05% <70.93%> (-0.20%)`	⬇️
postgres-150000	`75.04% <70.93%> (-0.04%)`	⬇️
postgres-170000	`?`
postgres-180000	`75.34% <70.93%> (+0.25%)`	⬆️
sync-service	`75.40% <70.93%> (+0.02%)`	⬆️
typescript	`87.27% <ø> (ø)`
unit-tests	`76.96% <70.93%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

alco reviewed Oct 7, 2025

View reviewed changes

robacourt force-pushed the rob/define-shape-subsystem branch from b20359f to f19a25f Compare October 8, 2025 11:50

robacourt added 10 commits October 8, 2025 12:57

Remove waiting_for_consumers state

302bc54

Move replication supervisor to being direct child of stack supervior

465bc55

Remove redundant Connection.Supervisor

6ec00a1

Restart core on StatusMonitor crash

0f0b448

Add comments

2dbbe78

Restart shape subsystem on timeline change

2b47d6f

Rename Conn.Man state

eb9c506

Do not restart shape subsystem when initializing

ae11ad2

Remove flakey tests

0bc800e

Fix Conn.Man tests after rebase introducing the canary process

47f588a

robacourt force-pushed the rob/define-shape-subsystem branch from f19a25f to 47f588a Compare October 8, 2025 12:12

robacourt added 2 commits October 8, 2025 14:38

Add additional check to StatusMonitor

a344515

Do not wait for shapes to be loaded before starting connection system

005cab3

robacourt mentioned this pull request Oct 21, 2025

(chore)sync-service: Do not clear storage when initializing #3315

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(sync-service): Keep consumers on loss of replication client #3238

feat(sync-service): Keep consumers on loss of replication client #3238

Uh oh!

robacourt commented Oct 6, 2025 •

edited

Loading

Uh oh!

netlify bot commented Oct 6, 2025 •

edited

Loading

Uh oh!

alco left a comment

Uh oh!

alco Oct 7, 2025

Uh oh!

robacourt commented Oct 7, 2025

Uh oh!

robacourt commented Oct 7, 2025

Uh oh!

alco commented Oct 7, 2025

Uh oh!

alco commented Oct 7, 2025

Uh oh!

robacourt commented Oct 7, 2025 •

edited

Loading

Uh oh!

robacourt commented Oct 8, 2025

Uh oh!

alco commented Oct 8, 2025

Uh oh!

codecov bot commented Oct 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(sync-service): Keep consumers on loss of replication client #3238

Are you sure you want to change the base?

feat(sync-service): Keep consumers on loss of replication client #3238

Uh oh!

Conversation

robacourt commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for electric-next ready!

Uh oh!

alco left a comment

Choose a reason for hiding this comment

Uh oh!

alco Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

robacourt commented Oct 7, 2025

Uh oh!

robacourt commented Oct 7, 2025

Uh oh!

alco commented Oct 7, 2025

Uh oh!

alco commented Oct 7, 2025

Uh oh!

robacourt commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robacourt commented Oct 8, 2025

Uh oh!

alco commented Oct 8, 2025

Uh oh!

codecov bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

robacourt commented Oct 6, 2025 •

edited

Loading

netlify bot commented Oct 6, 2025 •

edited

Loading

robacourt commented Oct 7, 2025 •

edited

Loading

codecov bot commented Oct 8, 2025 •

edited

Loading