pagecache/vfs: ZFS page-cache bridge, readahead, writeback, fsync flush by gburd · Pull Request #1398 · cloudius-systems/osv

gburd · 2026-06-24T11:19:28Z

Make the OSv VFS page cache cooperate with the OpenZFS port landing later in this series, and fix a data-durability bug in sys_fsync.

This branch bases directly on current master (3df7df7, the just-merged mmu-shm work). Two commits.

Changes

pagecache: ZFS bridging, sequential readahead, periodic writeback (54430d8).
- Expose C-linkage helpers (osv_pagecache_map_page, etc.) so the OpenZFS vop_cache in zfs_vnops_os.c can register/look up cached pages without pulling C++ pagecache headers into module sources. Also fixes a GCC 14 ambiguity on a templated helper.
- Remove the original ARC-bridge code path (shared ARC<->read_cache pages). It was never reachable: IS_ZFS() always returned false on OSv and the bridge structures were only initialised on the dead branch. A comment documents the decision so it isn't re-attempted.
- Sequential readahead (window grows on consecutive hits, resets on seek) plus a 5 s periodic writeback worker with a global dirty-page cap.
vfs: flush page cache before VOP_FSYNC in sys_fsync (2ef4642).
sys_fsync() called VOP_FSYNC without first flushing the OSv page cache, so dirty cached pages were never seen by the filesystem's fsync hook -- a process could fsync() and still lose data on crash (reproducer: write 64 KiB, fsync, kill VM, restart, read zeros). Walk the file's dirty pages, write them back, then VOP_FSYNC, holding f_lock across the flush so concurrent writes can't slip in.

Verification

Kernel compiles and links clean on GCC 14.3 / Boost 1.87 (./scripts/build image=empty, fresh loader.elf, RC=0). The page-cache helpers are consumed by the later OpenZFS PR; this PR adds only the kernel-side surface and the fsync fix.

Three pagecache changes that work together to make the OSv VFS page cache cooperate with OpenZFS: - Expose C-linkage helpers (osv_pagecache_map_page, osv_pagecache_*) so the OpenZFS vop_cache implementation in zfs_vnops_os.c can register and look up cached pages without dragging the C++ pagecache headers into kernel-module sources. Also fixes a GCC 14 ambiguity error on a templated helper. - Remove the original ARC-bridge code path that tried to share pages between the ZFS ARC and the OSv read_cache. It was never reachable: IS_ZFS() always returned false on OSv (m_fsid distinct), and the bridge data structures were only initialised on the unreachable branch. Document the design decision in a comment block above the (now removed) site so future readers don't try the same approach again. - Sequential readahead and a periodic writeback worker. The readahead window grows on consecutive cache hits and resets on seek; the writeback worker flushes dirty pages every 5 s with a global cap so dirty pages can't accumulate without bound. Verified by tst-mmap-file, tst-zfs-direct-io, and tst-fs-bench.

sys_fsync() called VOP_FSYNC directly without first flushing the OSv page cache. Dirty pages held in the cache were never seen by the underlying filesystem's fsync hook, so a process could fsync() a file and have the data still resident in volatile memory. Reproducer: write 64 KiB, fsync, kill the VM, restart, see zeros. Walk the file's dirty pages, write them back via the filesystem's write op, then call VOP_FSYNC. Holds the file's f_lock across the flush so concurrent writes can't slip in between the writeback and the VOP_FSYNC call. Verified by tst-zfs-direct-io and tst-zfs-multirec, which write, fsync, re-open, read, and memcmp. Without this fix the multi-record ZFS test produces zero-filled tail records on uncached read.

bdev_read/bdev_write looped one 512-byte block at a time through the buffer cache, issuing exactly one synchronous bio per BSIZE under the global bio_lock. A 128K device-node transfer became 256 serialized round-trips, pinning throughput at QD1 regardless of caller concurrency (measured ~1600 IOPS on EBS gp3, invariant across block size). For whole-sector transfers (offset and every iov_len a multiple of BSIZE) dispatch one bio per iovec through dev->driver->devops->strategy (multiplex_strategy), which splits by max_io_size and issues all children before waiting. N concurrent callers now keep N requests in flight. strategy() adds dev->offset exactly once, matching the prior rw_buf()->strategy() path, so partition addressing is unchanged. The unaligned/sub-sector case keeps the buffer-cache fallback but fixes two latent bugs the removed debug-only asserts had masked under NDEBUG: reads now copy from bp->b_data + (offset % BSIZE) instead of the block start, and writes read-modify-write via bread instead of getblk so the untouched remainder of the sector is preserved. ZFS and rofs call devops->strategy directly and never enter bdev_read, so they are unaffected; the only remaining buffer-cache consumer is the one-shot partition-table read at device attach.

gburd added 3 commits June 24, 2026 06:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

pagecache/vfs: ZFS page-cache bridge, readahead, writeback, fsync flush#1398

pagecache/vfs: ZFS page-cache bridge, readahead, writeback, fsync flush#1398
gburd wants to merge 3 commits into
cloudius-systems:masterfrom
gburd:pr/pagecache

gburd commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

gburd commented Jun 24, 2026

Changes

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant