Async FilesystemStore #3931

joostjager · 2025-07-15T13:12:41Z

Async filesystem store with eventually consistent writes. It is just using tokio's spawn_blocking, because that is what tokio::fs would otherwise do as well. Using tokio::fs would make it complicated to reuse the sync code.

ldk-node try out: lightningdevkit/ldk-node@main...joostjager:ldk-node:async-fsstore

ldk-reviews-bot · 2025-07-15T13:12:44Z

👋 Thanks for assigning @TheBlueMatt as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

tnull · 2025-07-15T13:49:38Z

lightning-persister/src/fs_store_async.rs

+		let this = Arc::clone(&self.inner);
+
+		Box::pin(async move {
+			tokio::task::spawn_blocking(move || {


Mhh, so I'm not sure if spawning blocking tasks for every IO call is the way to go (see for example https://docs.rs/tokio/latest/tokio/fs/index.html#tuning-your-file-io: "To get good performance with file IO on Tokio, it is recommended to batch your operations into as few spawn_blocking calls as possible."). Maybe there are other designs that we should at least consider before moving forward with this approach. For example, we could create a dedicated pool of longer-lived worker task(s) that process a queue?

If we use spawn_blocking, can we give the user control over which runtime this exactly will be spawned on? Also, rather than just doing wrapping, should we be using tokio::fs?

Mhh, so I'm not sure if spawning blocking tasks for every IO call is the way to go (see for example https://docs.rs/tokio/latest/tokio/fs/index.html#tuning-your-file-io: "To get good performance with file IO on Tokio, it is recommended to batch your operations into as few spawn_blocking calls as possible.").

If we should batch operations, I think the current approach is better than using tokio::fs? Because it already batches the various operations inside kvstoresync::write.

Further batching probably needs to happen at a higher level in LDK, and might be a bigger change. Not sure if that is worth it just for FIlesystemStore, especially when that store is not the preferred store for real world usage?

For example, we could create a dedicated pool of longer-lived worker task(s) that process a queue?

Isn't Tokio doing that already when a task is spawned?

If we use spawn_blocking, can we give the user control over which runtime this exactly will be spawned on? Also, rather than just doing wrapping, should we be using tokio::fs?

With tokio::fs, the current runtime is used. I'd think that that is then also sufficient if we spawn ourselves, without a need to specifiy which runtime exactly?

More generally, I think the main purpose of this PR is to show how an async kvstore could be implemented, and to have something for testing potentially. Additionally if there are users that really want to use this type of store in production, they could. But I don't think it is something to spend too much time on. A remote database is probably the more important target to design for.

With tokio::fs, the current runtime is used. I'd think that that is then also sufficient if we spawn ourselves, without a need to specifiy which runtime exactly?

Hmm, I'm not entirely sure, especially for users that have multiple runtime contexts floating around, it might be important to make sure the store uses a particular one (cc @domZippilli ?). I'll also have to think through this for LDK Node when we make the switch to async KVStore there, but happy to leave as-is for now.

tnull · 2025-07-15T13:50:24Z

lightning/src/util/persist.rs

 }

 /// Provides additional interface methods that are required for [`KVStore`]-to-[`KVStore`]
 /// data migration.
-pub trait MigratableKVStore: KVStore {
+pub trait MigratableKVStore: KVStoreSync {


How will we solve this for an KVStore?

I think this comment belongs in #3905?

We might not need to solve it now, as long as we still require a sync implementation alongside an async one? If we support async-only kvstores, then we can create an async version of this trait?

lightning-persister/src/fs_store.rs

joostjager · 2025-07-15T15:38:14Z

Removed garbage collector, because we need to keep the last written version.

codecov · 2025-07-23T19:56:31Z

Codecov Report

❌ Patch coverage is 91.09312% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.77%. Comparing base (c2d9b97) to head (2dbf59c).
⚠️ Report is 24 commits behind head on main.

Files with missing lines	Patch %	Lines
lightning-persister/src/fs_store.rs	91.09%	10 Missing and 12 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3931      +/-   ##
==========================================
- Coverage   88.77%   88.77%   -0.01%     
==========================================
  Files         175      175              
  Lines      127760   128086     +326     
  Branches   127760   128086     +326     
==========================================
+ Hits       113425   113714     +289     
- Misses      11780    11807      +27     
- Partials     2555     2565      +10

Flag	Coverage Δ
fuzzing	`22.21% <44.07%> (+0.10%)`	⬆️
tests	`88.61% <91.09%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lightning-persister/src/fs_store_async.rs

joostjager · 2025-07-25T13:48:12Z

Updated code to not use an async wrapper, but conditionally expose the async KVStore trait on FilesystemStore.

I didn't yet update the ldk-node branch using this PR, because it seems many other things broke in main again.

tnull · 2025-08-20T11:35:06Z

Fuzz sanity caught something. Interesting.

Are you referring to the current fuzz breakage? That is likely just breakage post-#3897, which should be fixed by #4022, so a rebase should fix it for you.

joostjager · 2025-08-20T13:23:02Z

Rebased to see if fuzz error disappears

TheBlueMatt · 2025-08-20T21:04:22Z

lightning-persister/src/fs_store.rs

+		let inner_lock_ref: Arc<RwLock<AsyncState>> = self.get_inner_lock_ref(dest_file_path);
+
+		let new_version = {
+			let mut async_state = inner_lock_ref.write().unwrap();


Bleh, this means that if there's a write happening for a key and another write starts for the same key, the task spawning the second write async will end up blocking until the first write completes. This should be easy to remedy by moving the lock onto just the latest_written_version field and making the latest_version field an atomic.

I don't know about also adding an atomic. I see new edge cases coming towards me then. Especially because we now also use the latest version to determine if there a writes in flight.

Isn't it acceptable to block in case the same file is written again and is that even likely to happen? The big win is parallel writes to different files and we got that.

Hmm, I don't see using an atomic for this as that complicated. Specifically, I don't think we're actually relying on the lock here at all.

I was thinking of the line

let more_writes_pending = async_state.latest_written_version < async_state.latest_version;

I think there are gaps again when using an atomic?

No because we also check the Arc reference count. Basically, the steps are (a) take the top-level lock and with that lock get a reference, then (b) get a version number then do the write, and finally (c) take the top-level lock and with that lock check if there's other references not yet complete. In fact, for the purpose of cleaning the map, I don't think we need to be looking at the version at all. The version really only needs to order the writes.

In fact, it might be more readable to explicitly disentangle the version numbers from the map cleanup, separating the concepts.

I thought it was a nice simplification to remove the inflight_write counter and instead derive that value from the latest_version and latest_written_version. But if I understand you correctly, you are suggesting that the Arc is already the inflight_write counter. Because of that earlier refactor where the inner lock ref is only obtained once and passed in the future, it indeed seems to work exactly like that. Cool. Added fixup.

lightning-persister/src/fs_store.rs

joostjager · 2025-08-22T11:14:14Z

Fuzzer found an issue, fixup commit "f: fix remove clean up"

tnull

Took a first look at the fuzzer parts. I wonder if we would get any notable performance benefit from running the FilesystemStore fuzzer on a ramdisk? Or would we even lose some coverage going this way as it's exactly the IO latency that increases the chances of running into race conditions etc?

tnull · 2025-08-22T11:47:42Z

fuzz/Cargo.toml

 bech32 = "0.11.0"
 bitcoin = { version = "0.32.2", features = ["secp-lowmemory"] }
+tokio = { version = "1.35.*", default-features = false, features = ["rt-multi-thread"] }


nit: This is more common (note not 100% equivalent, but probably preferable):

Suggested change

tokio = { version = "1.35.*", default-features = false, features = ["rt-multi-thread"] }

tokio = { version = "1.35", default-features = false, features = ["rt-multi-thread"] }

Or is there any reason we don't want any API-compatible version 1.36 and above?

Yes, it doesn't work with rust 1.63

Yes, it doesn't work with rust 1.63

Huh, but why can we get away with 1.35 below in the actual lightning-persister dependency then? Also, while the * works, you'd usually rather see ~1.35 used.

For some reason, the compiler decided that 1.35 could safely be bumped to 1.47. Also happened in CI.

error: package `tokio v1.47.1` cannot be built because it requires rustc 1.70 or newer, while the currently active rustc version is 1.63.0 ~/repo/rust-lightning/fuzz (async-fsstore ✗) cargo tree -i tokio tokio v1.47.1 ├── lightning-fuzz v0.0.1 (/Users/joost/repo/rust-lightning/fuzz) └── lightning-persister v0.2.0+git (/Users/joost/repo/rust-lightning/lightning-persister) └── lightning-fuzz v0.0.1 (/Users/joost/repo/rust-lightning/fuzz) ```

Made it ~1.35, reads nicer indeed.

tnull · 2025-08-22T11:54:06Z

fuzz/src/bin/fs_store_target.rs

+	use lightning_fuzz::utils::test_logger::StringBuffer;
+
+	use std::sync::{atomic, Arc};
+	// {


nit: Remove commented-out code.

Ah yes. I was still wondering what that code was for. Some default fuzz string sanity check?

tnull · 2025-08-22T11:59:09Z

fuzz/src/fs_store.rs

+	let secondary_namespace = "secondary";
+	let key = "key";
+
+	// Remove the key in case something was left over from a previous run.


Hmm, rather than doing this, do we want to add a random suffix to temp_path above, so that we're sure to start with a clean directory every time? Also, do we want to clean up the filesystem store directory at the end of the run, similar to what we do in lightning-persister tests?

Added random suffixes. It is also necessary because fuzzing runs in parallel. I used uuid for simplicity, but can also generate names differently if preferred.

Also added clean up. I couldn't just copy the Drop trait, because the FilesystemStore isn't in the same crate. So created a wrapper for it. Maybe there is a better way to do it.

tnull · 2025-08-22T12:03:01Z

fuzz/src/fs_store.rs

+				let fut = futures.remove(fut_idx);
+
+				fut.await.unwrap();
+			},


It shouldn't change anything, but do we want to throw in some coverage for KVStore::list for good measure?

Added. Only I don't think we can assert anything because things may be in flight. It does add some extra variation to the test to also list during async ops.

Also added read. Some story, nothing to assert, but we do cover read execution during writes.

lightning-persister/src/fs_store.rs

joostjager · 2025-08-22T14:46:41Z

Considered the RAM disk, but it is platform-specific. @TheBlueMatt suggested an alternative option which is to allow the injection of the actual disk write handler into FilesystemStore and supply an in-memory implementation for fuzzing. But perhaps we are stretching the scope of this PR too much then, so wanted to see if we can keep it to what it is currently?

joostjager · 2025-08-25T14:27:54Z

Fuzz passes, but some deviating log lines show up:

Sz:16 Tm:78,882us (i/b/h/e/p/c) New:0/0/0/0/0/9, Cur:0/0/0/1108/49/48567
Sz:5 Tm:17,602us (i/b/h/e/p/c) New:0/0/0/0/0/16, Cur:0/0/0/1108/49/48583
Sz:1 Tm:17,008us (i/b/h/e/p/c) New:0/0/0/0/0/5, Cur:0/0/0/1108/49/48588
Sz:22 Tm:75,985us (i/b/h/e/p/c) New:0/0/0/1/0/27, Cur:0/0/0/1109/49/48615
[2025-08-25T13:56:42+0000][W][28702] subproc_checkTimeLimit():532 pid=28711 took too much time (limit 1 s). Killing it with SIGKILL
[2025-08-25T13:56:42+0000][W][28703] subproc_checkTimeLimit():532 pid=28715 took too much time (limit 1 s). Killing it with SIGKILL
Sz:44 Tm:137,892us (i/b/h/e/p/c) New:0/0/0/1/0/32, Cur:0/0/0/1110/49/48647
Sz:7 Tm:71,188us (i/b/h/e/p/c) New:0/0/0/0/0/5, Cur:0/0/0/1110/49/48652
[2025-08-25T13:56:42+0000][W][28703] arch_checkWait():237 Persistent mode: pid=28715 exited with status: SIGNALED, signal: 9 (Killed)
Sz:3138 Tm:1,007,070us (i/b/h/e/p/c) New:0/0/0/263/3/13664, Cur:0/0/0/1373/52/62316
Sz:7 Tm:38,881us (i/b/h/e/p/c) New:0/0/0/3/0/55, Cur:0/0/0/1376/52/62371
[2025-08-25T13:56:42+0000][W][28702] arch_checkWait():237 Persistent mode: pid=28711 exited with status: SIGNALED, signal: 9 (Killed)
Sz:5662 Tm:1,013,168us (i/b/h/e/p/c) New:0/0/0/2/0/41, Cur:0/0/0/1378/52/62412
Persistent mode: Launched new persistent pid=30858
Persistent mode: Launched new persistent pid=30878

joostjager · 2025-08-25T16:41:39Z

Using /dev/shm as ramdisk if present fixed the timeouts.

joostjager · 2025-08-25T19:28:59Z

Tested with a RAM disk on macos using the tool https://github.com/conorarmstrong/macOS-ramdisk, to see if it isn't too fast now to catch problems. I think it is ok. On my machine RAM disk is about 10x faster than disk. Also when removing the is_stale_version check, it is caught by the fuzzer.

joostjager changed the title ~~Async fsstore~~ Async FilesystemStore Jul 15, 2025

joostjager force-pushed the async-fsstore branch 4 times, most recently from 29b8bcf to 81ad668 Compare July 15, 2025 13:40

tnull reviewed Jul 15, 2025

View reviewed changes

joostjager force-pushed the async-fsstore branch from 81ad668 to e462bce Compare July 15, 2025 15:22

joostjager added this to Weekly Goals Jul 17, 2025

joostjager self-assigned this Jul 17, 2025

joostjager mentioned this pull request Jul 17, 2025

Async Persistence TODOs #3052

Open

24 tasks

joostjager force-pushed the async-fsstore branch 2 times, most recently from 97d6b3f to 02dce94 Compare July 23, 2025 18:11

joostjager force-pushed the async-fsstore branch 2 times, most recently from c061fcd to 2492508 Compare July 24, 2025 08:31

joostjager marked this pull request as ready for review July 24, 2025 08:32

ldk-reviews-bot requested a review from tankyleo July 24, 2025 08:32

joostjager force-pushed the async-fsstore branch 2 times, most recently from 9938dfe to 7d98528 Compare July 24, 2025 09:39

joostjager commented Jul 24, 2025

View reviewed changes

lightning-persister/src/fs_store_async.rs Outdated Show resolved Hide resolved

graphite-app bot reviewed Jul 24, 2025

View reviewed changes

lightning-persister/src/fs_store_async.rs Outdated Show resolved Hide resolved

joostjager force-pushed the async-fsstore branch 5 times, most recently from 38ab949 to dd9e1b5 Compare July 25, 2025 13:39

joostjager requested a review from tnull July 25, 2025 13:51

joostjager added 4 commits August 20, 2025 15:22

Add async implementation of FilesystemStore

5abaf92

f: remove implicit feature

5ea1a8e

f: new approach

99fe94f

f: various fixes

744fdc8

joostjager force-pushed the async-fsstore branch from 528b414 to 744fdc8 Compare August 20, 2025 13:22

TheBlueMatt reviewed Aug 20, 2025

View reviewed changes

joostjager force-pushed the async-fsstore branch from 7b46ffb to 6f24148 Compare August 22, 2025 11:09

f: fix remove clean up

7a514d8

joostjager force-pushed the async-fsstore branch from 6f24148 to c96aaff Compare August 22, 2025 11:13

joostjager added 2 commits August 22, 2025 13:39

f: add fuzzing

c659e34

f: fix arc count gap

29e2861

joostjager force-pushed the async-fsstore branch from c96aaff to 29e2861 Compare August 22, 2025 11:39

tnull reviewed Aug 22, 2025

View reviewed changes

graphite-app bot reviewed Aug 22, 2025

View reviewed changes

lightning-persister/src/fs_store.rs Show resolved Hide resolved

f: fuzz review comments

217a027

joostjager force-pushed the async-fsstore branch from 732f87c to 217a027 Compare August 22, 2025 14:39

joostjager force-pushed the async-fsstore branch from 665cb25 to 5f7008b Compare August 24, 2025 06:03

f: simplify cleanup

1a3631c

joostjager force-pushed the async-fsstore branch from 5f7008b to 1a3631c Compare August 25, 2025 07:38

f: async parallel tasks

12d443a

joostjager added 2 commits August 25, 2025 21:33

f: fuzz iter

7cc7dcc

f: use /dev/shm if available

2dbf59c

joostjager force-pushed the async-fsstore branch from 4efbeee to 2dbf59c Compare August 25, 2025 19:34

	tokio = { version = "1.35.*", default-features = false, features = ["rt-multi-thread"] }
	tokio = { version = "1.35", default-features = false, features = ["rt-multi-thread"] }

Async FilesystemStore #3931

Are you sure you want to change the base?

Async FilesystemStore #3931

Uh oh!

Conversation

joostjager commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldk-reviews-bot commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joostjager commented Jul 15, 2025

Uh oh!

codecov bot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

joostjager commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tnull commented Aug 20, 2025

Uh oh!

joostjager commented Aug 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joostjager commented Aug 22, 2025

Uh oh!

tnull left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joostjager Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joostjager Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joostjager commented Jul 15, 2025 •

edited

Loading

ldk-reviews-bot commented Jul 15, 2025 •

edited

Loading

codecov bot commented Jul 23, 2025 •

edited

Loading

joostjager commented Jul 25, 2025 •

edited

Loading

tnull left a comment •

edited

Loading

joostjager Aug 22, 2025 •

edited

Loading

joostjager Aug 22, 2025 •

edited

Loading

joostjager Aug 22, 2025 •

edited

Loading