pagecache/vfs: ZFS page-cache bridge, readahead, writeback, fsync flush#1398
Open
gburd wants to merge 3 commits into
Open
pagecache/vfs: ZFS page-cache bridge, readahead, writeback, fsync flush#1398gburd wants to merge 3 commits into
gburd wants to merge 3 commits into
Conversation
Three pagecache changes that work together to make the OSv VFS
page cache cooperate with OpenZFS:
- Expose C-linkage helpers (osv_pagecache_map_page, osv_pagecache_*)
so the OpenZFS vop_cache implementation in zfs_vnops_os.c can
register and look up cached pages without dragging the C++
pagecache headers into kernel-module sources. Also fixes a
GCC 14 ambiguity error on a templated helper.
- Remove the original ARC-bridge code path that tried to share
pages between the ZFS ARC and the OSv read_cache. It was never
reachable: IS_ZFS() always returned false on OSv (m_fsid
distinct), and the bridge data structures were only initialised
on the unreachable branch. Document the design decision in a
comment block above the (now removed) site so future readers
don't try the same approach again.
- Sequential readahead and a periodic writeback worker. The
readahead window grows on consecutive cache hits and resets on
seek; the writeback worker flushes dirty pages every 5 s with
a global cap so dirty pages can't accumulate without bound.
Verified by tst-mmap-file, tst-zfs-direct-io, and tst-fs-bench.
sys_fsync() called VOP_FSYNC directly without first flushing the OSv page cache. Dirty pages held in the cache were never seen by the underlying filesystem's fsync hook, so a process could fsync() a file and have the data still resident in volatile memory. Reproducer: write 64 KiB, fsync, kill the VM, restart, see zeros. Walk the file's dirty pages, write them back via the filesystem's write op, then call VOP_FSYNC. Holds the file's f_lock across the flush so concurrent writes can't slip in between the writeback and the VOP_FSYNC call. Verified by tst-zfs-direct-io and tst-zfs-multirec, which write, fsync, re-open, read, and memcmp. Without this fix the multi-record ZFS test produces zero-filled tail records on uncached read.
bdev_read/bdev_write looped one 512-byte block at a time through the buffer cache, issuing exactly one synchronous bio per BSIZE under the global bio_lock. A 128K device-node transfer became 256 serialized round-trips, pinning throughput at QD1 regardless of caller concurrency (measured ~1600 IOPS on EBS gp3, invariant across block size). For whole-sector transfers (offset and every iov_len a multiple of BSIZE) dispatch one bio per iovec through dev->driver->devops->strategy (multiplex_strategy), which splits by max_io_size and issues all children before waiting. N concurrent callers now keep N requests in flight. strategy() adds dev->offset exactly once, matching the prior rw_buf()->strategy() path, so partition addressing is unchanged. The unaligned/sub-sector case keeps the buffer-cache fallback but fixes two latent bugs the removed debug-only asserts had masked under NDEBUG: reads now copy from bp->b_data + (offset % BSIZE) instead of the block start, and writes read-modify-write via bread instead of getblk so the untouched remainder of the sector is preserved. ZFS and rofs call devops->strategy directly and never enter bdev_read, so they are unaffected; the only remaining buffer-cache consumer is the one-shot partition-table read at device attach.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Make the OSv VFS page cache cooperate with the OpenZFS port landing later in this series, and fix a data-durability bug in
sys_fsync.This branch bases directly on current
master(3df7df7, the just-merged mmu-shm work). Two commits.Changes
pagecache: ZFS bridging, sequential readahead, periodic writeback (54430d8).
osv_pagecache_map_page, etc.) so the OpenZFSvop_cacheinzfs_vnops_os.ccan register/look up cached pages without pulling C++ pagecache headers into module sources. Also fixes a GCC 14 ambiguity on a templated helper.IS_ZFS()always returned false on OSv and the bridge structures were only initialised on the dead branch. A comment documents the decision so it isn't re-attempted.vfs: flush page cache before VOP_FSYNC in sys_fsync (2ef4642).
sys_fsync()calledVOP_FSYNCwithout first flushing the OSv page cache, so dirty cached pages were never seen by the filesystem's fsync hook -- a process couldfsync()and still lose data on crash (reproducer: write 64 KiB, fsync, kill VM, restart, read zeros). Walk the file's dirty pages, write them back, thenVOP_FSYNC, holdingf_lockacross the flush so concurrent writes can't slip in.Verification
Kernel compiles and links clean on GCC 14.3 / Boost 1.87 (
./scripts/build image=empty, fresh loader.elf, RC=0). The page-cache helpers are consumed by the later OpenZFS PR; this PR adds only the kernel-side surface and the fsync fix.