engine: have sync() actually fsync the active log (#395) by 1fanwang · Pull Request #400 · tikv/raft-engine

1fanwang · 2026-04-27T20:31:31Z

Problem

Engine::sync() is documented as "Synchronizes the Raft engine" — i.e. flush previously-appended writes to durable storage. But it never actually calls fdatasync:

// src/engine.rs (master, line 235)
pub fn sync(&self) -> Result<()> {
    self.write(&mut LogBatch::default(), true)?;
    Ok(())
}

// src/engine.rs (master, line 142-145)
pub fn write(&self, log_batch: &mut LogBatch, mut sync: bool) -> Result<usize> {
    if log_batch.is_empty() {
        return Ok(0);
    }
    // …sync branch on line 179 is never reached for the empty default batch
}

The empty default batch short-circuits via the is_empty() early return on line 143, so pipe_log.sync(LogQueue::Append) on line 179 is never invoked. Every Engine::sync() call returns Ok(()) without fsync — silent durability loss.

This was reported by @AustenSchunk in #395 with full repro analysis.

Fix

Engine::sync() now calls self.pipe_log.sync(LogQueue::Append) directly. As a side benefit, sync errors propagate to the caller instead of relying on the panic guard inside Engine::write's write-group path.
Engine::write() also honours sync=true on empty batches by funnelling through Engine::sync(), so any other call site that passes an empty LogBatch with sync=true (now or in the future) gets the fsync it asked for rather than silently no-opping.

Test

tests/failpoints/test_engine.rs::test_sync_empty_batch_actually_syncs uses the existing log_fd::sync::err failpoint as a witness for the sync code path being reached:

With the bug present, Engine::sync() returns Ok(()) even with the failpoint armed because pipe_log.sync is never called.
With the fix, the injected fsync error propagates out of Engine::sync(), proving the durability path is exercised end-to-end.

The test also asserts the no-failpoint case still returns Ok(()) against a freshly-opened engine, so the fix doesn't accidentally start surfacing benign fsync activity as errors.

Closes

Fixes #395.

Notes

I couldn't run the failpoints test locally — cargo test --test failpoints --features failpoints pulls in grpcio-sys, which fails to build on a fresh macOS arm64 toolchain (CMake policy mismatch with the bundled gRPC). The non-failpoint test suite (cargo test) passes; happy to add anything else if CI surfaces a gap.

Summary by CodeRabbit

Bug Fixes
- Durability improved: explicit flush now occurs even when no new data is appended, ensuring requested syncs are honored.
Documentation
- Clarified behavior around durability and when flushes are performed (including empty-write scenarios).
Tests
- Added a regression test that verifies the sync path is exercised in the empty-batch + sync case.

Engine::sync() previously delegated to Engine::write(LogBatch::default(), true). Engine::write begins with if log_batch.is_empty() { return Ok(0); } so the empty default batch short-circuited before reaching the sync branch on line 179. The net effect: every Engine::sync() call returned Ok(()) without issuing fdatasync on the Append log queue at all, which breaks the durability contract the public API documents. Fix: - Engine::sync() now calls self.pipe_log.sync(LogQueue::Append) directly, bypassing the empty-batch early return entirely. As a side benefit it surfaces sync errors to the caller instead of relying on the panic in Engine::write (which only fires when sync() goes through the write group path with non-empty data). - Engine::write() also honours sync=true on empty batches by funnelling through Engine::sync(), so any other call site that passes an empty LogBatch with sync=true also gets the fsync it asked for rather than silently no-opping. Test: A failpoint regression test in tests/failpoints/test_engine.rs uses the existing log_fd::sync::err failpoint as a witness for the sync code path being reached. With the bug present, Engine::sync() returns Ok(()) even with the failpoint armed because pipe_log.sync is never called. With the fix, the injected fsync error propagates out of Engine::sync(), proving the durability path is exercised end-to-end. Fixes tikv#395 Signed-off-by: 1fanwang <1fannnw@gmail.com>

coderabbitai · 2026-04-27T20:31:44Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 50771310-9545-406a-b557-91ebd47cd46a

📥 Commits

Reviewing files that changed from the base of the PR and between 5cfc38a and 3f6c35e.

📒 Files selected for processing (2)

src/engine.rs
tests/failpoints/test_engine.rs

🚧 Files skipped from review as they are similar to previous changes (1)

tests/failpoints/test_engine.rs

📝 Walkthrough

Walkthrough

The PR ensures durability when requesting sync with an empty batch: Engine::write now invokes Engine::sync when sync=true even if log_batch is empty, and Engine::sync is reimplemented to call pipe_log.sync(LogQueue::Append) directly.

Changes

Cohort / File(s)	Summary
Engine Implementation `src/engine.rs`	`Engine::write(&self, log_batch: &mut LogBatch, sync: bool)` now honors `sync=true` for empty `log_batch` by calling `Engine::sync()` before returning. `Engine::sync(&self)` is reimplemented to call `self.pipe_log.sync(LogQueue::Append)` directly; docs updated to clarify durability behavior.
Failpoint Regression Test `tests/failpoints/test_engine.rs`	Added `test_sync_empty_batch_actually_syncs()` which asserts `engine.sync()` exercises the underlying fsync path (verifies success, then injects a failpoint to ensure sync triggers the underlying error/panic).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped in code both swift and spry,
An empty batch no longer sly.
I nudged the pipe to call the sync,
Now fsync hears my tiny clink.
Joyful thump — persistence is nigh! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: making Engine::sync() actually perform fsync on the active log, which is the core correctness bug being fixed.
Linked Issues check	✅ Passed	All coding requirements from issue `#395` are met: Engine::sync() now directly calls pipe_log.sync(), Engine::write() honors sync=true for empty batches, and a regression test verifies the fsync path is exercised.
Out of Scope Changes check	✅ Passed	All changes are narrowly scoped to fixing the Engine::sync() durability bug and adding a regression test; no unrelated modifications are present.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/failpoints/test_engine.rs (1)
1238-1257: LGTM on the regression test scaffolding.

Good first assertion that engine.sync() succeeds without the failpoint on a fresh engine (covers the no-op-on-fresh-engine case mentioned in the PR description) and the failpoint comment is informative.

One small gap: the test only exercises the Engine::sync() direct call path. The other new path introduced in this PR — Engine::write(&mut empty_batch, true) delegating to Engine::sync() — isn't directly covered here. test_empty_batch in src/engine.rs writes empty batches with sync=true, but it doesn't validate that an fsync is actually issued. Consider adding a parallel assertion in this test that calls engine.write(&mut LogBatch::default(), true) under the same failpoint to lock in that behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/failpoints/test_engine.rs` around lines 1238 - 1257, Add a second
assertion in test_sync_empty_batch_actually_syncs to exercise the write->sync
delegation path: enable the same failpoint (log_fd::sync::err) and call
engine.write(&mut LogBatch::default(), true), asserting that it returns the
injected fsync error just like engine.sync(); this ensures Engine::write (when
passed an empty batch with sync=true) actually delegates to Engine::sync and
triggers the fsync code path. Locate the test function
test_sync_empty_batch_actually_syncs and the existing engine.sync() assertion,
reuse the failpoint setup/comments, and add the parallel call to Engine::write
with LogBatch::default() and the same error expectation.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/engine.rs`:
- Around line 244-249: Update the Engine::sync doc comment to note the current
asymmetry: Engine::sync() returns Err from pipe_log.sync(LogQueue::Append) but
Engine::write(..., sync=true) still calls
self.pipe_log.sync(LogQueue::Append).expect("pipe::sync()") and will panic on
the same error path; mention both Engine::sync and Engine::write by name and
reference the pipe_log.sync(LogQueue::Append) call so callers know Err is only
reachable via Engine::sync() (follow-up could change Engine::write to propagate
the error).

In `@tests/failpoints/test_engine.rs`:
- Around line 1262-1272: The test fails because LogFileWriter::sync calls
self.handle.sync().unwrap() and panics on the injected failpoint, so change the
test to catch the unwind like the other tests: wrap the engine.sync() invocation
in catch_unwind_silent (or std::panic::catch_unwind) instead of calling .err();
assert that a panic occurred and that the panic payload or formatted message
contains "sync"; keep the FailGuard::new("log_fd::sync::err", "return") setup
and use the same message-check logic on the caught panic payload to verify the
fsync error surfaced.

---

Nitpick comments:
In `@tests/failpoints/test_engine.rs`:
- Around line 1238-1257: Add a second assertion in
test_sync_empty_batch_actually_syncs to exercise the write->sync delegation
path: enable the same failpoint (log_fd::sync::err) and call engine.write(&mut
LogBatch::default(), true), asserting that it returns the injected fsync error
just like engine.sync(); this ensures Engine::write (when passed an empty batch
with sync=true) actually delegates to Engine::sync and triggers the fsync code
path. Locate the test function test_sync_empty_batch_actually_syncs and the
existing engine.sync() assertion, reuse the failpoint setup/comments, and add
the parallel call to Engine::write with LogBatch::default() and the same error
expectation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 91dc939a-cb26-4b36-b613-53ff16eb0b70

📥 Commits

Reviewing files that changed from the base of the PR and between f0fcebe and 5cfc38a.

📒 Files selected for processing (2)

src/engine.rs
tests/failpoints/test_engine.rs

coderabbitai · 2026-04-27T20:36:02Z

+    // first and skipped pipe_log.sync entirely.
+    let _guard = FailGuard::new("log_fd::sync::err", "return");
+    let err = engine
+        .sync()
+        .err()
+        .expect("engine.sync() must reach pipe_log.sync() and surface the fsync error");
+    let msg = format!("{err}");
+    assert!(
+        msg.to_lowercase().contains("sync"),
+        "unexpected error: {msg}",
+    );


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Inspect the current implementation of LogFileWriter::sync and the call chain # from FilePipeLog::sync down to LogFd::sync to confirm whether the # `log_fd::sync::err` failpoint propagates as Err or panics via .unwrap(). set -e echo "=== LogFileWriter::sync (expected to propagate Err, not unwrap) ===" fd -t f 'log_file.rs' | xargs -I{} sh -c 'echo "--- {} ---"; cat {}' | sed -n '/impl[[:space:]]*LogFileWriter/,/^}$/p' | sed -n '/fn[[:space:]]\+sync/,/^[[:space:]]*}/p' echo echo "=== Any .unwrap()/.expect() on .sync() in the file_pipe_log layer ===" rg -nP '\.sync\s*\.\s*(unwrap|expect)\b' --type=rust echo echo "=== FilePipeLog::sync call chain ===" ast-grep --pattern $'fn sync(&self, queue: LogQueue) -> Result<()> { $$$ }' echo echo "=== LogFd::sync failpoint definition (sanity check) ===" rg -nP -C2 'log_fd::sync::err' --type=rust

Repository: tikv/raft-engine

Length of output: 3618

🏁 Script executed:

# Get the full context of LogFileWriter::sync sed -n '115,135p' src/file_pipe_log/log_file.rs

Repository: tikv/raft-engine

Length of output: 526

🏁 Script executed:

# Get the LogFileWriter struct definition to see what self.handle is grep -A 10 'struct LogFileWriter' src/file_pipe_log/log_file.rs | head -20

Repository: tikv/raft-engine

Length of output: 395

🏁 Script executed:

# Check what type self.handle has and verify the call chain rg -B 5 -A 8 'pub fn sync$&mut self$' src/file_pipe_log/log_file.rs

Repository: tikv/raft-engine

Length of output: 403

🏁 Script executed:

# Check test_engine.rs around the test to see full context and any test attributes sed -n '1235,1275p' tests/failpoints/test_engine.rs

Repository: tikv/raft-engine

Length of output: 1555

🏁 Script executed:

# Check if there are any other tests for log_fd::sync::err to understand expected behavior rg -B 3 -A 10 'log_fd::sync::err' tests/failpoints/test_io_error.rs | head -50

Repository: tikv/raft-engine

Length of output: 1559

The test cannot pass as written because LogFileWriter::sync panics on the injected error.

The failpoint log_fd::sync::err injects an Err into LogFd::sync(), but the call chain passes through LogFileWriter::sync (line 121–124), which calls self.handle.sync().unwrap(). This .unwrap() will panic when the failpoint triggers, preventing the error from propagating up to engine.sync().err().

The test expects engine.sync().err() to return Some(err), but instead the panic from LogFileWriter::sync will unwind the stack. Other tests using the same failpoint (e.g., test_io_error.rs lines 104–106, 159–161, 211–213) correctly use catch_unwind_silent() to handle the panic.

Either LogFileWriter::sync must be changed to propagate the error instead of unwrapping, or this test must catch the panic using catch_unwind rather than asserting on .err().

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/failpoints/test_engine.rs` around lines 1262 - 1272, The test fails because LogFileWriter::sync calls self.handle.sync().unwrap() and panics on the injected failpoint, so change the test to catch the unwind like the other tests: wrap the engine.sync() invocation in catch_unwind_silent (or std::panic::catch_unwind) instead of calling .err(); assert that a panic occurred and that the panic payload or formatted message contains "sync"; keep the FailGuard::new("log_fd::sync::err", "return") setup and use the same message-check logic on the caught panic payload to verify the fsync error surfaced.

Good catch — verified, you're right. LogFileWriter::sync does self.handle.sync().unwrap() (src/file_pipe_log/log_file.rs:124), so the failpoint surfaces as a panic, not an Err. Pushed 3f6c35e which switches the assertion to catch_unwind_silent matching the pattern in test_io_error.rs (lines 109, 161, 213, 386). The witness is unchanged: with the fix engine.sync() reaches LogFileWriter::sync and the .unwrap() panics on the injected error; without the fix the empty-batch early return means LogFileWriter::sync is never invoked and no panic occurs.

Only users with a collaborator, contributor, member, or owner role can interact with CodeRabbit.

Address coderabbitai review on tikv#400: LogFileWriter::sync calls self.handle.sync().unwrap() (src/file_pipe_log/log_file.rs:124), so the log_fd::sync::err failpoint surfaces as a panic during stack unwind, not as an Err from Engine::sync(). The previous test asserted on engine.sync().err(), which would never observe the panic and would fail the unwrap on it. Switch to catch_unwind_silent matching the existing pattern in test_io_error.rs (lines 109, 161, 213, 386). The witness is now: with the fix, engine.sync() reaches LogFileWriter::sync where the .unwrap() panics on the injected error; without the fix, the empty-batch early return in Engine::write means LogFileWriter::sync is never invoked and no panic occurs, so catch_unwind would observe Ok and the assertion would fail. Also document the sync error-propagation asymmetry in Engine::sync's doc comment so callers know Err is reachable through Engine::sync but not through the in-write sync hop (which still .expect()s). Signed-off-by: 1fanwang <1fannnw@gmail.com>

ti-chi-bot · 2026-04-27T21:12:37Z

Hi @1fanwang. Thanks for your PR.

I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot · 2026-04-27T21:12:43Z

Welcome @1fanwang!

It looks like this is your first PR to tikv/raft-engine 🎉.

I'm the bot to help you request reviewers, add labels and more, See available commands.

We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to tikv/raft-engine. 😃

tabokie

Thanks a lot!

ti-chi-bot · 2026-04-28T07:56:42Z

[LGTM Timeline notifier]

Timeline:

2026-04-28 07:56:41.751382209 +0000 UTC m=+2671006.956742256: ☑️ agreed by tabokie.

ti-chi-bot · 2026-04-28T07:56:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tabokie

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [tabokie]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 27, 2026

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

ti-chi-bot Bot added dco-signoff: yes Indicates the PR's author has signed the dco. contribution This PR is from a community contributor. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Apr 27, 2026

ti-chi-bot Bot added the first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. label Apr 27, 2026

tabokie approved these changes Apr 28, 2026

View reviewed changes

ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 28, 2026

ti-chi-bot Bot added the approved label Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

engine: have sync() actually fsync the active log (#395)#400

engine: have sync() actually fsync the active log (#395)#400
1fanwang wants to merge 2 commits intotikv:masterfrom
1fanwang:fix/sync-empty-batch-no-flush-issue-395

1fanwang commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 27, 2026 •

edited

Loading

Uh oh!

1fanwang Apr 27, 2026

Uh oh!

coderabbitai Bot Apr 27, 2026

Uh oh!

ti-chi-bot Bot commented Apr 27, 2026

Uh oh!

ti-chi-bot Bot commented Apr 27, 2026

Uh oh!

tabokie left a comment

Uh oh!

ti-chi-bot Bot commented Apr 28, 2026

Uh oh!

ti-chi-bot Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

1fanwang commented Apr 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Test

Closes

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

1fanwang Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented Apr 27, 2026

Uh oh!

ti-chi-bot Bot commented Apr 27, 2026

Uh oh!

tabokie left a comment

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented Apr 28, 2026

[LGTM Timeline notifier]

Uh oh!

ti-chi-bot Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1fanwang commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

coderabbitai Bot Apr 27, 2026 •

edited

Loading