engine: have sync() actually fsync the active log (#395)#400
engine: have sync() actually fsync the active log (#395)#4001fanwang wants to merge 2 commits intotikv:masterfrom
Conversation
Engine::sync() previously delegated to Engine::write(LogBatch::default(),
true). Engine::write begins with
if log_batch.is_empty() {
return Ok(0);
}
so the empty default batch short-circuited before reaching the sync
branch on line 179. The net effect: every Engine::sync() call returned
Ok(()) without issuing fdatasync on the Append log queue at all, which
breaks the durability contract the public API documents.
Fix:
- Engine::sync() now calls self.pipe_log.sync(LogQueue::Append) directly,
bypassing the empty-batch early return entirely. As a side benefit it
surfaces sync errors to the caller instead of relying on the panic in
Engine::write (which only fires when sync() goes through the write
group path with non-empty data).
- Engine::write() also honours sync=true on empty batches by funnelling
through Engine::sync(), so any other call site that passes an empty
LogBatch with sync=true also gets the fsync it asked for rather than
silently no-opping.
Test:
A failpoint regression test in tests/failpoints/test_engine.rs uses the
existing log_fd::sync::err failpoint as a witness for the sync code path
being reached. With the bug present, Engine::sync() returns Ok(()) even
with the failpoint armed because pipe_log.sync is never called. With the
fix, the injected fsync error propagates out of Engine::sync(), proving
the durability path is exercised end-to-end.
Fixes tikv#395
Signed-off-by: 1fanwang <1fannnw@gmail.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe PR ensures durability when requesting sync with an empty batch: Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/failpoints/test_engine.rs (1)
1238-1257: LGTM on the regression test scaffolding.Good first assertion that
engine.sync()succeeds without the failpoint on a fresh engine (covers the no-op-on-fresh-engine case mentioned in the PR description) and the failpoint comment is informative.One small gap: the test only exercises the
Engine::sync()direct call path. The other new path introduced in this PR —Engine::write(&mut empty_batch, true)delegating toEngine::sync()— isn't directly covered here.test_empty_batchinsrc/engine.rswrites empty batches withsync=true, but it doesn't validate that an fsync is actually issued. Consider adding a parallel assertion in this test that callsengine.write(&mut LogBatch::default(), true)under the same failpoint to lock in that behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/failpoints/test_engine.rs` around lines 1238 - 1257, Add a second assertion in test_sync_empty_batch_actually_syncs to exercise the write->sync delegation path: enable the same failpoint (log_fd::sync::err) and call engine.write(&mut LogBatch::default(), true), asserting that it returns the injected fsync error just like engine.sync(); this ensures Engine::write (when passed an empty batch with sync=true) actually delegates to Engine::sync and triggers the fsync code path. Locate the test function test_sync_empty_batch_actually_syncs and the existing engine.sync() assertion, reuse the failpoint setup/comments, and add the parallel call to Engine::write with LogBatch::default() and the same error expectation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/engine.rs`:
- Around line 244-249: Update the Engine::sync doc comment to note the current
asymmetry: Engine::sync() returns Err from pipe_log.sync(LogQueue::Append) but
Engine::write(..., sync=true) still calls
self.pipe_log.sync(LogQueue::Append).expect("pipe::sync()") and will panic on
the same error path; mention both Engine::sync and Engine::write by name and
reference the pipe_log.sync(LogQueue::Append) call so callers know Err is only
reachable via Engine::sync() (follow-up could change Engine::write to propagate
the error).
In `@tests/failpoints/test_engine.rs`:
- Around line 1262-1272: The test fails because LogFileWriter::sync calls
self.handle.sync().unwrap() and panics on the injected failpoint, so change the
test to catch the unwind like the other tests: wrap the engine.sync() invocation
in catch_unwind_silent (or std::panic::catch_unwind) instead of calling .err();
assert that a panic occurred and that the panic payload or formatted message
contains "sync"; keep the FailGuard::new("log_fd::sync::err", "return") setup
and use the same message-check logic on the caught panic payload to verify the
fsync error surfaced.
---
Nitpick comments:
In `@tests/failpoints/test_engine.rs`:
- Around line 1238-1257: Add a second assertion in
test_sync_empty_batch_actually_syncs to exercise the write->sync delegation
path: enable the same failpoint (log_fd::sync::err) and call engine.write(&mut
LogBatch::default(), true), asserting that it returns the injected fsync error
just like engine.sync(); this ensures Engine::write (when passed an empty batch
with sync=true) actually delegates to Engine::sync and triggers the fsync code
path. Locate the test function test_sync_empty_batch_actually_syncs and the
existing engine.sync() assertion, reuse the failpoint setup/comments, and add
the parallel call to Engine::write with LogBatch::default() and the same error
expectation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 91dc939a-cb26-4b36-b613-53ff16eb0b70
📒 Files selected for processing (2)
src/engine.rstests/failpoints/test_engine.rs
| // first and skipped pipe_log.sync entirely. | ||
| let _guard = FailGuard::new("log_fd::sync::err", "return"); | ||
| let err = engine | ||
| .sync() | ||
| .err() | ||
| .expect("engine.sync() must reach pipe_log.sync() and surface the fsync error"); | ||
| let msg = format!("{err}"); | ||
| assert!( | ||
| msg.to_lowercase().contains("sync"), | ||
| "unexpected error: {msg}", | ||
| ); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect the current implementation of LogFileWriter::sync and the call chain
# from FilePipeLog::sync down to LogFd::sync to confirm whether the
# `log_fd::sync::err` failpoint propagates as Err or panics via .unwrap().
set -e
echo "=== LogFileWriter::sync (expected to propagate Err, not unwrap) ==="
fd -t f 'log_file.rs' | xargs -I{} sh -c 'echo "--- {} ---"; cat {}' | sed -n '/impl[[:space:]]*LogFileWriter/,/^}$/p' | sed -n '/fn[[:space:]]\+sync/,/^[[:space:]]*}/p'
echo
echo "=== Any .unwrap()/.expect() on .sync() in the file_pipe_log layer ==="
rg -nP '\.sync\(\)\s*\.\s*(unwrap|expect)\b' --type=rust
echo
echo "=== FilePipeLog::sync call chain ==="
ast-grep --pattern $'fn sync(&self, queue: LogQueue) -> Result<()> {
$$$
}'
echo
echo "=== LogFd::sync failpoint definition (sanity check) ==="
rg -nP -C2 'log_fd::sync::err' --type=rustRepository: tikv/raft-engine
Length of output: 3618
🏁 Script executed:
# Get the full context of LogFileWriter::sync
sed -n '115,135p' src/file_pipe_log/log_file.rsRepository: tikv/raft-engine
Length of output: 526
🏁 Script executed:
# Get the LogFileWriter struct definition to see what self.handle is
grep -A 10 'struct LogFileWriter' src/file_pipe_log/log_file.rs | head -20Repository: tikv/raft-engine
Length of output: 395
🏁 Script executed:
# Check what type self.handle has and verify the call chain
rg -B 5 -A 8 'pub fn sync\(&mut self\)' src/file_pipe_log/log_file.rsRepository: tikv/raft-engine
Length of output: 403
🏁 Script executed:
# Check test_engine.rs around the test to see full context and any test attributes
sed -n '1235,1275p' tests/failpoints/test_engine.rsRepository: tikv/raft-engine
Length of output: 1555
🏁 Script executed:
# Check if there are any other tests for log_fd::sync::err to understand expected behavior
rg -B 3 -A 10 'log_fd::sync::err' tests/failpoints/test_io_error.rs | head -50Repository: tikv/raft-engine
Length of output: 1559
The test cannot pass as written because LogFileWriter::sync panics on the injected error.
The failpoint log_fd::sync::err injects an Err into LogFd::sync(), but the call chain passes through LogFileWriter::sync (line 121–124), which calls self.handle.sync().unwrap(). This .unwrap() will panic when the failpoint triggers, preventing the error from propagating up to engine.sync().err().
The test expects engine.sync().err() to return Some(err), but instead the panic from LogFileWriter::sync will unwind the stack. Other tests using the same failpoint (e.g., test_io_error.rs lines 104–106, 159–161, 211–213) correctly use catch_unwind_silent() to handle the panic.
Either LogFileWriter::sync must be changed to propagate the error instead of unwrapping, or this test must catch the panic using catch_unwind rather than asserting on .err().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/failpoints/test_engine.rs` around lines 1262 - 1272, The test fails
because LogFileWriter::sync calls self.handle.sync().unwrap() and panics on the
injected failpoint, so change the test to catch the unwind like the other tests:
wrap the engine.sync() invocation in catch_unwind_silent (or
std::panic::catch_unwind) instead of calling .err(); assert that a panic
occurred and that the panic payload or formatted message contains "sync"; keep
the FailGuard::new("log_fd::sync::err", "return") setup and use the same
message-check logic on the caught panic payload to verify the fsync error
surfaced.
There was a problem hiding this comment.
Good catch — verified, you're right. LogFileWriter::sync does self.handle.sync().unwrap() (src/file_pipe_log/log_file.rs:124), so the failpoint surfaces as a panic, not an Err. Pushed 3f6c35e which switches the assertion to catch_unwind_silent matching the pattern in test_io_error.rs (lines 109, 161, 213, 386). The witness is unchanged: with the fix engine.sync() reaches LogFileWriter::sync and the .unwrap() panics on the injected error; without the fix the empty-batch early return means LogFileWriter::sync is never invoked and no panic occurs.
There was a problem hiding this comment.
Only users with a collaborator, contributor, member, or owner role can interact with CodeRabbit.
Address coderabbitai review on tikv#400: LogFileWriter::sync calls self.handle.sync().unwrap() (src/file_pipe_log/log_file.rs:124), so the log_fd::sync::err failpoint surfaces as a panic during stack unwind, not as an Err from Engine::sync(). The previous test asserted on engine.sync().err(), which would never observe the panic and would fail the unwrap on it. Switch to catch_unwind_silent matching the existing pattern in test_io_error.rs (lines 109, 161, 213, 386). The witness is now: with the fix, engine.sync() reaches LogFileWriter::sync where the .unwrap() panics on the injected error; without the fix, the empty-batch early return in Engine::write means LogFileWriter::sync is never invoked and no panic occurs, so catch_unwind would observe Ok and the assertion would fail. Also document the sync error-propagation asymmetry in Engine::sync's doc comment so callers know Err is reachable through Engine::sync but not through the in-write sync hop (which still .expect()s). Signed-off-by: 1fanwang <1fannnw@gmail.com>
|
Hi @1fanwang. Thanks for your PR. I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Welcome @1fanwang! |
[LGTM Timeline notifier]Timeline:
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tabokie The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Problem
Engine::sync()is documented as "Synchronizes the Raft engine" — i.e. flush previously-appended writes to durable storage. But it never actually callsfdatasync:The empty default batch short-circuits via the
is_empty()early return on line 143, sopipe_log.sync(LogQueue::Append)on line 179 is never invoked. EveryEngine::sync()call returnsOk(())without fsync — silent durability loss.This was reported by @AustenSchunk in #395 with full repro analysis.
Fix
Engine::sync()now callsself.pipe_log.sync(LogQueue::Append)directly. As a side benefit, sync errors propagate to the caller instead of relying on the panic guard insideEngine::write's write-group path.Engine::write()also honourssync=trueon empty batches by funnelling throughEngine::sync(), so any other call site that passes an emptyLogBatchwithsync=true(now or in the future) gets the fsync it asked for rather than silently no-opping.Test
tests/failpoints/test_engine.rs::test_sync_empty_batch_actually_syncsuses the existinglog_fd::sync::errfailpoint as a witness for the sync code path being reached:Engine::sync()returnsOk(())even with the failpoint armed becausepipe_log.syncis never called.Engine::sync(), proving the durability path is exercised end-to-end.The test also asserts the no-failpoint case still returns
Ok(())against a freshly-opened engine, so the fix doesn't accidentally start surfacing benign fsync activity as errors.Closes
Fixes #395.
Notes
I couldn't run the failpoints test locally —
cargo test --test failpoints --features failpointspulls ingrpcio-sys, which fails to build on a fresh macOS arm64 toolchain (CMake policy mismatch with the bundled gRPC). The non-failpoint test suite (cargo test) passes; happy to add anything else if CI surfaces a gap.Summary by CodeRabbit
Bug Fixes
Documentation
Tests