Skip to content

feat: WAL-based RocksDB replication with HTTP streaming and failover#366

Open
JackGuslerGit wants to merge 11 commits intomatrix-construct:mainfrom
JackGuslerGit:failover
Open

feat: WAL-based RocksDB replication with HTTP streaming and failover#366
JackGuslerGit wants to merge 11 commits intomatrix-construct:mainfrom
JackGuslerGit:failover

Conversation

@JackGuslerGit
Copy link
Copy Markdown
Contributor

This relates to #35.

Summary:

  • Adds a primary/secondary replication system using RocksDB's WAL (Write-Ahead Log) streamed over HTTP
  • Secondary bootstraps from a full checkpoint on startup, then streams incremental WAL frames
  • Failover is triggered via POST /_tuwunel/replication/promote — no process restart needed
  • All replication endpoints are protected by a shared secret token

Test plan:

  • Ran two Docker containers (primary on :8008, secondary on :8009)
  • Secondary bootstrapped from primary checkpoint at seq 281 and began streaming
  • Stopped primary with docker stop (graceful SIGTERM)
  • Promoted secondary via curl — responded {"status":"promoted"}
  • All messages from before the failover were present on the promoted instance
  • Measured RPO ~0 on planned failover, RTO = seconds

Relevant config options added:

  • rocksdb_primary_url — URL of primary for WAL streaming
  • rocksdb_replication_token — shared secret for endpoint auth
  • rocksdb_replication_interval_ms — heartbeat interval (default 250ms)

@JackGuslerGit JackGuslerGit marked this pull request as ready for review March 12, 2026 14:54
@JackGuslerGit JackGuslerGit marked this pull request as draft March 12, 2026 14:59
@JackGuslerGit JackGuslerGit marked this pull request as ready for review March 13, 2026 15:04
@pschichtel
Copy link
Copy Markdown

this implements async replication, so some dataloss is to be expected after an expected failover (node failure, disk failure process crash, ...), right?

@JackGuslerGit
Copy link
Copy Markdown
Contributor Author

@pschichtel yes, that is correct. Under normal write load, RPO is determined just by network RTT.

@JackGuslerGit
Copy link
Copy Markdown
Contributor Author

Hey @x86pup. It seems the CI failures on this PR are all runner-side cache issues. I am seeing two issues:

1: failed to create locked file '/opt/rust/cargo/debian/x86_64-linux-gnu/git/db/rust-rocksdb-eed8465c83fc7d81/config.lock': File exists; class=Os (2); code=Locked (-14)

2: could not open '/opt/rust/cargo/debian/x86_64-linux-gnu/git/checkouts/ruma-8006605ea0e2ea25/30d063c/crates/ruma-events/src/poll/unstable_start/unstable_poll_answers_serde.rs' for writing: No such file or directory; class=Os (2)

Can you clean these up and re-run?

@jevolk
Copy link
Copy Markdown
Member

jevolk commented Mar 17, 2026

Hey @x86pup. It seems the CI failures on this PR are all runner-side cache issues. I am seeing two issues:

1: failed to create locked file '/opt/rust/cargo/debian/x86_64-linux-gnu/git/db/rust-rocksdb-eed8465c83fc7d81/config.lock': File exists; class=Os (2); code=Locked (-14)

2: could not open '/opt/rust/cargo/debian/x86_64-linux-gnu/git/checkouts/ruma-8006605ea0e2ea25/30d063c/crates/ruma-events/src/poll/unstable_start/unstable_poll_answers_serde.rs' for writing: No such file or directory; class=Os (2)

Can you clean these up and re-run?

These docker flakes sometimes occur when CI is really busy, apologies! We'll be happy to rerun as necessary.

@jevolk
Copy link
Copy Markdown
Member

jevolk commented Mar 17, 2026

I haven't had a chance to thoroughly review this yet since I'm currently away, but a few things stand out as suspicious.

Foremost it's not clear why WAL streaming is necessary. RocksDB already has internal mechanisms to synchronize primary and secondary; all that's missing is the promotion signalling. What is the basis for concerning ourselves with binary framing of rocksdb inner-workings at the user level? Is the rocksdb synchronization API being invoked here? Perhaps I missed it...

@JackGuslerGit
Copy link
Copy Markdown
Contributor Author

@jevolk Yes, you are correct, TryCatchUpWithPrimary() handles sync internally. Correct me if I'm wrong, but it requires the secondary instance to open the primary's database directory directly. This works when both instances share the same filesystem (same machine or NFS mount).

In our case, we have a cluster of physical servers where each server has its own local disk. We can't use NFS/shared storage in our infrastructure. So core2 has no direct filesystem access to core1's RocksDB directory.

That's the gap we're trying to fill by replicating the WAL and SST files over the network, so core2 can stay in sync with core1 without shared storage. Once core2 has a local copy of the data, TryCatchUpWithPrimary() could potentially still be used if we mirror the primary's files locally, or we apply the WAL batches ourselves.

Is there a mechanism in RocksDB you'd recommend for this case, or is shared storage assumed in your deployment model?

@jevolk
Copy link
Copy Markdown
Member

jevolk commented Mar 18, 2026

@jevolk Yes, you are correct, TryCatchUpWithPrimary() handles sync internally. Correct me if I'm wrong, but it requires the secondary instance to open the primary's database directory directly. This works when both instances share the same filesystem (same machine or NFS mount).

In our case, we have a cluster of physical servers where each server has its own local disk. We can't use NFS/shared storage in our infrastructure. So core2 has no direct filesystem access to core1's RocksDB directory.

That's the gap we're trying to fill by replicating the WAL and SST files over the network, so core2 can stay in sync with core1 without shared storage. Once core2 has a local copy of the data, TryCatchUpWithPrimary() could potentially still be used if we mirror the primary's files locally, or we apply the WAL batches ourselves.

Is there a mechanism in RocksDB you'd recommend for this case, or is shared storage assumed in your deployment model?

Alright so this is not limited to shared filesystem mounts, that's rather exciting actually. Keep up the good work 👍

@JackGuslerGit
Copy link
Copy Markdown
Contributor Author

@jevolk Thanks! I see it has passed all checks, what's the next step here?

@x86pup
Copy link
Copy Markdown
Member

x86pup commented Mar 18, 2026

It needs to be thoroughly reviewed here especially since the usage of AI is apparent, Jason is on vacation and will get to it soon. Thank you for ensuring CI passes to help this along.

@JackGuslerGit
Copy link
Copy Markdown
Contributor Author

Okay sounds good, thanks for letting me know!

@jevolk jevolk self-assigned this Mar 20, 2026
@jevolk
Copy link
Copy Markdown
Member

jevolk commented Apr 2, 2026

Thank you for your patience 🙏 I'm right around the corner now...

@jevolk jevolk linked an issue Apr 4, 2026 that may be closed by this pull request
jevolk pushed a commit that referenced this pull request Apr 5, 2026
Add query and stream features; enhance replication routes and logic
jevolk added a commit that referenced this pull request Apr 5, 2026
Signed-off-by: Jason Volk <jason@zemos.net>
jevolk added a commit that referenced this pull request Apr 5, 2026
Signed-off-by: Jason Volk <jason@zemos.net>
jevolk pushed a commit that referenced this pull request Apr 5, 2026
Add query and stream features; enhance replication routes and logic
jevolk added a commit that referenced this pull request Apr 5, 2026
Signed-off-by: Jason Volk <jason@zemos.net>
jevolk added a commit that referenced this pull request Apr 5, 2026
Signed-off-by: Jason Volk <jason@zemos.net>
@JackGuslerGit
Copy link
Copy Markdown
Contributor Author

Hey @jevolk. I ran into an issue with the checkpoint logic while testing. The original code swapped the RocksDB database directory while RocksDB was already running, causing file corruption because open file descriptors still pointed to the old files while new writes went to the checkpoint copy. The fix moves the checkpoint download and filesystem swap to before RocksDB opens, so the database always starts fresh from a clean checkpoint with no live files being touched.

@jevolk
Copy link
Copy Markdown
Member

jevolk commented Apr 8, 2026

Hey @jevolk. I ran into an issue with the checkpoint logic while testing. The original code swapped the RocksDB database directory while RocksDB was already running, causing file corruption because open file descriptors still pointed to the old files while new writes went to the checkpoint copy. The fix moves the checkpoint download and filesystem swap to before RocksDB opens, so the database always starts fresh from a clean checkpoint with no live files being touched.

Thanks for finding this. I made an attempt at merging this but ran out of time before the 1.6 release with a few loose ends still. The main issue primarily dealt with switching to CBOR for the wire format which makes more sense for several reasons.

@jevolk
Copy link
Copy Markdown
Member

jevolk commented Apr 15, 2026

I'll be revisiting this again at the top of the 1.6.1 dev cycle (start of next week). I only have a small number of re-organizations and applying CBOR (which is hugely simplifying) so this should go in pretty early on. Thank you again for your patience 🙏🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hot-failover with load-balanced spare

4 participants