feat: introduce pluggable SpillFile trait and TempFileFactory for custom spill backends#21882
Conversation
|
@alamb I opened this draft PR to get early feedback on the architecture.
I might be missing something here, so would really appreciate your guidance. |
|
Thanks -- will try and look at this shortly |
Kind of, though it seems like accumulating technical debt as we'll have APIs that will not be needed once we complete the work for SortMergeJoin What do you think about making a first PR to migrate SortMergeJoin to use the spill abstraction?
Makes sense to me |
alamb
left a comment
There was a problem hiding this comment.
Thanks @pantShrey - I reviewed this and the basic idea looks good to me. I do think it would be nice to have a unified (async) IO abstraction rather than leaving some hook around for sync IO and making this API more complicated
| used_disk_space: Arc<AtomicU64>, | ||
| /// Number of active temporary files created by this disk manager | ||
| active_files_count: Arc<AtomicUsize>, | ||
| /// Custom Backend |
There was a problem hiding this comment.
A small nit: I think "custom" is a somewhat unecessary term here . Perhaps this
factory: Option<Arc<dyn TempFileFactory>>,or
temp_file_factory: Option<Arc<dyn TempFileFactory>>,would be more consistent with the rest of the codebase
| .collect() | ||
| } | ||
|
|
||
| pub struct OsSpillWriter { |
| /// Writer for spill file backends. | ||
| /// Receives zero-copy `Bytes` payloads from the IPCStreamWriter adapter. | ||
| pub trait SpillWriter: Send { | ||
| fn write(&mut self, data: Bytes) -> Result<()>; |
There was a problem hiding this comment.
This is pretty similar to https://doc.rust-lang.org/std/io/trait.Write.html 🤔
There was a problem hiding this comment.
Yes, you are right. The reason I didn't use Write trait which uses &[u8] was for ownership reasons. Some backends might queue chunks to a background task (e.g., S3 multipart via a channel) and need to hold the data past the write() call's return. &[u8] can't express that, and it would force a second copy between the SpillWriteAdapter and the SpillWriter.
Also, the custom SpillWriter trait contains finish(), which maps perfectly to complete_multipart_upload for S3 and resource owner cleanup for Postgres.
There was a problem hiding this comment.
This is all true -- however, I think that since the underlying IPC writer takes a std::io::Write, forcing all backends to use Bytes will likely require an extra unecessary copy (see comments below on SpillWriterAdapter) anways.
If you use a std::io::write like interface here, backends that want to queue chunks can do so (by copying into Bytes buffers themselves)
Thus what i suggest is:
- Change this to look more like std::io::wrote:
fn write(&mut self, data: &[u8]) -> Result<()>;Which will allow you to get rid of the write adapter
|
@alamb Thank you so much for the review! I scoped out the SortMergeJoin migration today, specifically looking at bitwise_stream.rs and process_key_match_with_filter, to see what it would take. Because SortMergeJoin currently reads from the spill file via a synchronous for loop inside a hand-rolled poll state machine, making the read path truly async requires a major rewrite. We can't just .await the stream, so we may need to store the SendableRecordBatchStream in the execution state and manually persist variables like matched_count across Poll::Pending yields. Because ParadeDB is hoping to unblock their Postgres integration next week, I'm worried a state machine rewrite of this scale will stall them. Would you be open to merging this core abstraction first (with open_sync_reader marked as #[deprecated])? I can open a dedicated tracking issue for the SortMergeJoin async migration and tackle it as a fast follow-up PR. I am happy to defer to your judgment if you feel the tech debt must be addressed first! |
How about we try it in parallel? |
@alamb sure, i have already started to work on that locally while waiting for the response also i am actually still stuck on the The issue stems from the fact that RepartitionMerge now requires more memory than a RepartitionExec node, this greedily allocates memory to RepartitionExec which could have spilled instead of RepartitionMerge which cannot spill. I would really appreciate any guidance on this, am I missing something obvious here? |
Sadly I am not familar with this test so I don't have a lot to offer you Maybe you can look at git history and see who introduced the test and maybe they might have some ideas |
|
Hey @adriangb, Andrew suggested I reach out to you since you originally authored The test is currently stuck in a memory-accounting deadlock. Here’s what is happening:
I was able to trigger a spill once by setting the test memory limit to 608 B, but even that was not sufficient for the test to pass reliably. Is there a correct or idiomatic way to configure this test (batch sizes, data volume, memory pool limits, etc.) to reliably force a I would really appreciate any guidance you could provide. |
|
IIRC that test was added when we added spilling to RepartitionExec. Conceptually the test is simple: if RepartitionExec is configured to preserve order and it spills we need to make sure that spilling did not shuffle the data. The orchestration however is difficult: forcing a RepartitionExec to spill usually requires skewed upstream partition consumption rates. You could try to change the test to eg use a GroupBy or maybe we can use a RepartitionExec in isolation if we pull from the streams in the right way. I think the structure can be changed quite a bit as long as we preserve the semantic meaning of the test, I am not surprised that it is pretty fragile to changes. |
2971e41 to
de6697f
Compare
|
@alamb I’ve addressed the nits and force-pushed the updates. Could you please trigger the CI and take another look when you have a moment? In the meantime, I am working on migrating |
|
@adriangb Thank you so much for the guidance! I updated the test to simply assert that a spill does occur |
That makes sense to me. |
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
|
|
e31bff4 to
086632a
Compare
|
Hey @alamb, quick update! While working on the I believe the PR is now ready for review, so I've marked it as such. I'd appreciate another look whenever you have the time. Thank you! |
Grerat-- can you please make a PR for just the SMJ refactor and then stack this PR on it? |
|
That will make it easier / faster to review (I am not a SMJ expert so I can't really review that part effiicently) |
915532b to
6954b55
Compare
|
Hey @alamb, quick update! I've reworked both PRs to make them easier to review independently:
The plan is for #22230 to merge first, then I'll rebase this on top of it. Would be grateful if you could take a look whenever you get the chance! |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageexternal_aggr — base (merge-base)
external_aggr — branch
File an issue against this benchmark runner |
| } | ||
|
|
||
| /// Writer for spill file backends. | ||
| pub trait SpillWriter: std::io::Write + Send { |
There was a problem hiding this comment.
It was also strange that the SpillWriter is a sync API, but the read stream API is async
fn read_stream(&self) -> Result<Pin<Box<dyn Stream<Item = Result<Bytes>> + Send>>>;I found this while working on an example showing how to write to a remote object store
|
@alamb I spent some time looking into the From what I can tell, increasing the My current understanding is that the previous tokio_util::io::ReaderStream::with_capacity(file, 128 * 1024)I've pushed the change mainly so the CI benchmarks can run and to get your thoughts. If this direction makes sense, or should I make it configurable instead? |
This comment was marked as outdated.
This comment was marked as outdated.
|
run benchmarks spill_io |
alamb
left a comment
There was a problem hiding this comment.
I took a look at the new code and it looks good to me. I'll plan to merge this PR subject to the benchmarks looking good
Thanks @pantShrey
| Ok(file) => Box::pin( | ||
| tokio_util::io::ReaderStream::new(file) | ||
| .map(|r| r.map_err(DataFusionError::IoError)), | ||
| // Use a 1MB read buffer. The default 8KB causes excessive async |
|
run benchmark external_aggr sort_tpch spill_io |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing abstract-spill-file (4b93040) to 367f08e (merge-base) diff using: spill_io File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing abstract-spill-file (4b93040) to 367f08e (merge-base) diff using: spill_io File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing abstract-spill-file (4b93040) to 367f08e (merge-base) diff using: external_aggr File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing abstract-spill-file (4b93040) to 367f08e (merge-base) diff using: sort_tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagespill_io — base (merge-base)
spill_io — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagespill_io — base (merge-base)
spill_io — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagesort_tpch — base (merge-base)
sort_tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageexternal_aggr — base (merge-base)
external_aggr — branch
File an issue against this benchmark runner |
|
Al right -- I think this is looking good and I put it in th emerge queue |
|
@alamb Thank you so much for all your time and guidance on this one! I learned a ton working through the architecture with you. I really appreciate your help throughout the whole process and for getting this merged! |
|
Thank you to the both of you!! 🙏 |
Which issue does this PR close?
Rationale for this change
DataFusion’s spill infrastructure is tightly coupled to OS-level files, with no extension points for alternative storage backends.
DiskManagercannot be customized for file creation, andIPCStreamWriterdepends on OS file paths.This prevents integration in environments where temporary storage must be managed by the host system. For example, Postgres extensions (e.g., ParadeDB) require spill files to go through
BufFileAPIs to respecttemp_tablespaces, enforcetemp_file_limit, and integrate with transaction-scoped cleanup. SinceBufFilehas no OS-visible path, it cannot work with the current design.A secondary motivation raised by @alamb is supporting object storage backends (S3, GCS) for spilling, which require async IO and cannot use
std::io::Writeorstd::io::Read.What changes are included in this PR?
SpillFile,SpillWriter, andTempFileFactorytraits to abstract spill file handlingDiskManagerMode::Customto allow pluggable backendsDiskManagerto returnArc<dyn SpillFile>instead of OS-bound typesSpillWriteAdapterto bridge sync Arrow writers with backend-agnostic writersStream<Item = Result<Bytes>>) instead of blocking state machinesArc<dyn SpillFile>Are these changes tested?
Yes. Existing spill tests cover the full read/write flow.
test_disk_usage_decreases_as_files_consumedby correcting a pre-existing off-by-one assumption in file rotationtest_preserve_order_with_spillingby just asserting spilling occurs (spill_count>0) and output batches are sortedAre there any user-facing changes?
Yes this introduces API changes:
Arc<dyn SpillFile>instead ofRefCountedTempFileSpillFile,SpillWriter,TempFileFactoryDiskManagerMode::Customfor custom backendsCustom spill backends can now be implemented and plugged in via
DiskManager.