Skip to content

Conversation

@robacourt
Copy link
Contributor

@robacourt robacourt commented Oct 6, 2025

Currently connection errors will restart the ReplicationClient, ShapeLogCollector and all shape Consumers, of which there can be many thousands. This can be slow and cause errors due to the processes not being available to the API.

This PR defines a connection subsystem and a shape subsystem:

  1. The connection subsystem (processes that may exit on a connection failure), started with the Connection.Manager.Supervisor
  2. The shape subsystem (processes that are resilient to connection failures), started with Electric.Replication.Supervisor

These two subsystems can now go down independently. Timeline changes or new slots now trigger the shape system to be restarted.

The supervision tree now looks like this:

StackSupervisor
-utility processes such as the EtsInspector that can be restarted indepently
-MonitoredCoreSupervisor

  • StatusMonitor
  • CoreSupervisor
    • Replication.Supervisor (or perhaps it should be called the ShapeSystemSupervisor)
    • Connection.Manager.Supervisor (or perhaps it should be called the ConnectionSystemSupervisor)

This PR does not look to address making the ShapeLogCollector resilient to connection failure, that will be done in a separate PR.

This PR is currently a WIP due to merge issues. It would be good to get some feedback on the approach and then I suspect I'll need to make quite a few changes to rebase over @alco's PRs (#3198 and #3230)

@netlify
Copy link

netlify bot commented Oct 6, 2025

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit 47f588a
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/68e6553cfd58ef0008e5b89e
😎 Deploy Preview https://deploy-preview-3238--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Member

@alco alco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work!

Replication.Supervisor (or perhaps it should be called the ShapeSystemSupervisor)
Connection.Manager.Supervisor (or perhaps it should be called the ConnectionSystemSupervisor)

Sounds good to me, but I would keep it consistent with the code comments and call them ShapeSubsystemSupervisor and ConnectionSubsystemSupervisor.


The only change that bugs me is that the whole replication supervisor has to be started before the connection manager starts initializing database connections. Since the two subsystems can now fail independently of each other, it would only be fair for them to start in parallel and reduce the time until the stack becomes ready.

The consumers_ready message needs to be transformed into a StatusMonitor entry which the connection manager can look up and wait on.

If the connection manager detects a timeline change, there's no other recourse than terminating the shape subsystem (even if it's still starting up) and starting it with a clean state. I don't particularly see the reason to treat the case of count_shapes==0 differently: this case seems like a rare outlier.


Regarding the coordination with my PR, I can rebase yours on top of mine myself, if mine is merged first. I don't think it makes sense for you to switch context to the new way of config passing here, you can just keep working in context of main and bring the PR to completion.

It will be easier for me to rebase it later since that's basically what I've already done in my PR, a number of times actually as I've had to resolve conflicts with multiple other PRs that got merged to main.

# `Electric.Replication.Supervisor` which is responsible for starting the shape log collector
# and individual shape consumer process trees.
#
# See the moduledoc in `Electric.Connection.Supervisor` for more info.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole comment is outdated and badly misplaced, I would suggest removing it and adding the following paragraph to the comment block just above the def children_application line:

    # The root application supervisor starts the core service processes, such as the HTTP
    # server for the HTTP API and telemetry exporters, and a single StackSupervisor, basically
    # making the application run in a single-tenant mode where all API requests are forwarded
    # to that sole tenant.

@robacourt
Copy link
Contributor Author

I don't particularly see the reason to treat the case of count_shapes==0 differently: this case seems like a rare outlier.

It's not rare at all, it happens when the system first starts up because the slot is new. It would seem odd to start the replication client twice on first start-up. I specifically added that condition for this case.

@robacourt
Copy link
Contributor Author

The only change that bugs me is that the whole replication supervisor has to be started before the connection manager starts initializing database connections. Since the two subsystems can now fail independently of each other, it would only be fair for them to start in parallel and reduce the time until the stack becomes ready.

So there's no way as far as I know to do that in the supervision tree, so I assume you mean for ShapeCache to load the shapes as part of a :continue rather than as part of init?

@alco
Copy link
Member

alco commented Oct 7, 2025

I don't particularly see the reason to treat the case of count_shapes==0 differently: this case seems like a rare outlier.

It's not rare at all, it happens when the system first starts up because the slot is new. It would seem odd to start the replication client twice on first start-up. I specifically added that condition for this case.

Ah, I see. Because the new slot creation sets purge_all_shapes to true.

I don't think we should reset storage in this case. When the system first starts up, it doesn't have any timeline stored, so the timeline check returns :ok. We should be able to identify this state and skip resetting the storage.

So what should be happening is this:

  • timeline check returns :no_previous_timeline or similar
  • conn man skips checking state.purge_all_shapes and just moves on

If we do have a previous timeline stored, then state.purge_all_shapes is checked together with the timeline check to decide whether to reset the storage.

@alco
Copy link
Member

alco commented Oct 7, 2025

The only change that bugs me is that the whole replication supervisor has to be started before the connection manager starts initializing database connections. Since the two subsystems can now fail independently of each other, it would only be fair for them to start in parallel and reduce the time until the stack becomes ready.

So there's no way as far as I know to do that in the supervision tree, so I assume you mean for ShapeCache to load the shapes as part of a :continue rather than as part of init?

Yes, that seems to be the way to do it. But we can no longer just send a message to connection manager because it may not be up by that time. So the coordination about consumers being ready needs to happen via StatusMonitor, IMO.

@robacourt
Copy link
Contributor Author

robacourt commented Oct 7, 2025

I don't think we should reset storage in this case. When the system first starts up, it doesn't have any timeline stored, so the timeline check returns :ok. We should be able to identify this state and skip resetting the storage.

I like the idea of skipping the reset, but it's not the timeline that triggers it, it's replication_client_created_new_slot being sent by the ReplicationClient. I'm not quite sure at this stage how I would stop that.

(edit) Oh, actually you're saying I could use the :no_previous_timeline to not check purge_all_shapes. Ok, that could work. Thanks!

@robacourt
Copy link
Contributor Author

The consumers_ready message needs to be transformed into a StatusMonitor entry which the connection manager can look up and wait on.

Where we called consumers_ready we calling ShapeLogCollector.set_last_processed_lsn which in turn calls StatusMonitor.mark_shape_log_collector_ready so I assume that should be enough unless I'm missing something?

@alco
Copy link
Member

alco commented Oct 8, 2025

@robacourt oh yes, this way it's even better: when Connection Manager is ready to start streaming, it checks whether the ShapeLogCollector itself is ready. If not, it waits for it to become ready and then instructs the replication client to start streaming. You've hit the nail on the head!

@robacourt robacourt force-pushed the rob/define-shape-subsystem branch from b20359f to f19a25f Compare October 8, 2025 11:50
@robacourt robacourt force-pushed the rob/define-shape-subsystem branch from f19a25f to 47f588a Compare October 8, 2025 12:12
@codecov
Copy link

codecov bot commented Oct 8, 2025

Codecov Report

❌ Patch coverage is 70.93023% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.96%. Comparing base (603d5ef) to head (005cab3).
⚠️ Report is 110 commits behind head on main.

Files with missing lines Patch % Lines
...es/sync-service/lib/electric/connection/manager.ex 0.00% 10 Missing ⚠️
...kages/sync-service/lib/electric/core_supervisor.ex 85.18% 8 Missing ⚠️
...vice/lib/electric/connection/manager/supervisor.ex 16.66% 5 Missing ⚠️
...ckages/sync-service/lib/electric/status_monitor.ex 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3238      +/-   ##
==========================================
- Coverage   76.98%   76.96%   -0.03%     
==========================================
  Files         180      181       +1     
  Lines        9639     9652      +13     
  Branches      334      333       -1     
==========================================
+ Hits         7421     7429       +8     
- Misses       2216     2221       +5     
  Partials        2        2              
Flag Coverage Δ
elixir 75.26% <70.93%> (-0.03%) ⬇️
elixir-client 73.94% <ø> (-0.53%) ⬇️
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/typescript-client 94.37% <ø> (ø)
packages/y-electric 55.12% <ø> (ø)
postgres-140000 75.05% <70.93%> (-0.20%) ⬇️
postgres-150000 75.04% <70.93%> (-0.04%) ⬇️
postgres-170000 ?
postgres-180000 75.34% <70.93%> (+0.25%) ⬆️
sync-service 75.40% <70.93%> (+0.02%) ⬆️
typescript 87.27% <ø> (ø)
unit-tests 76.96% <70.93%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants