Skip to content

pagecache/vfs: ZFS page-cache bridge, readahead, writeback, fsync flush#1398

Open
gburd wants to merge 3 commits into
cloudius-systems:masterfrom
gburd:pr/pagecache
Open

pagecache/vfs: ZFS page-cache bridge, readahead, writeback, fsync flush#1398
gburd wants to merge 3 commits into
cloudius-systems:masterfrom
gburd:pr/pagecache

Conversation

@gburd

@gburd gburd commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Make the OSv VFS page cache cooperate with the OpenZFS port landing later in this series, and fix a data-durability bug in sys_fsync.

This branch bases directly on current master (3df7df7, the just-merged mmu-shm work). Two commits.

Changes

  • pagecache: ZFS bridging, sequential readahead, periodic writeback (54430d8).

    • Expose C-linkage helpers (osv_pagecache_map_page, etc.) so the OpenZFS vop_cache in zfs_vnops_os.c can register/look up cached pages without pulling C++ pagecache headers into module sources. Also fixes a GCC 14 ambiguity on a templated helper.
    • Remove the original ARC-bridge code path (shared ARC<->read_cache pages). It was never reachable: IS_ZFS() always returned false on OSv and the bridge structures were only initialised on the dead branch. A comment documents the decision so it isn't re-attempted.
    • Sequential readahead (window grows on consecutive hits, resets on seek) plus a 5 s periodic writeback worker with a global dirty-page cap.
  • vfs: flush page cache before VOP_FSYNC in sys_fsync (2ef4642).
    sys_fsync() called VOP_FSYNC without first flushing the OSv page cache, so dirty cached pages were never seen by the filesystem's fsync hook -- a process could fsync() and still lose data on crash (reproducer: write 64 KiB, fsync, kill VM, restart, read zeros). Walk the file's dirty pages, write them back, then VOP_FSYNC, holding f_lock across the flush so concurrent writes can't slip in.

Verification

Kernel compiles and links clean on GCC 14.3 / Boost 1.87 (./scripts/build image=empty, fresh loader.elf, RC=0). The page-cache helpers are consumed by the later OpenZFS PR; this PR adds only the kernel-side surface and the fsync fix.

gburd added 3 commits June 24, 2026 06:27
Three pagecache changes that work together to make the OSv VFS
page cache cooperate with OpenZFS:

  - Expose C-linkage helpers (osv_pagecache_map_page, osv_pagecache_*)
    so the OpenZFS vop_cache implementation in zfs_vnops_os.c can
    register and look up cached pages without dragging the C++
    pagecache headers into kernel-module sources.  Also fixes a
    GCC 14 ambiguity error on a templated helper.
  - Remove the original ARC-bridge code path that tried to share
    pages between the ZFS ARC and the OSv read_cache.  It was never
    reachable: IS_ZFS() always returned false on OSv (m_fsid
    distinct), and the bridge data structures were only initialised
    on the unreachable branch.  Document the design decision in a
    comment block above the (now removed) site so future readers
    don't try the same approach again.
  - Sequential readahead and a periodic writeback worker.  The
    readahead window grows on consecutive cache hits and resets on
    seek; the writeback worker flushes dirty pages every 5 s with
    a global cap so dirty pages can't accumulate without bound.

Verified by tst-mmap-file, tst-zfs-direct-io, and tst-fs-bench.
sys_fsync() called VOP_FSYNC directly without first flushing the
OSv page cache.  Dirty pages held in the cache were never seen by
the underlying filesystem's fsync hook, so a process could fsync()
a file and have the data still resident in volatile memory.
Reproducer: write 64 KiB, fsync, kill the VM, restart, see zeros.

Walk the file's dirty pages, write them back via the filesystem's
write op, then call VOP_FSYNC.  Holds the file's f_lock across the
flush so concurrent writes can't slip in between the writeback and
the VOP_FSYNC call.

Verified by tst-zfs-direct-io and tst-zfs-multirec, which write,
fsync, re-open, read, and memcmp.  Without this fix the multi-record
ZFS test produces zero-filled tail records on uncached read.
bdev_read/bdev_write looped one 512-byte block at a time through the
buffer cache, issuing exactly one synchronous bio per BSIZE under the
global bio_lock. A 128K device-node transfer became 256 serialized
round-trips, pinning throughput at QD1 regardless of caller concurrency
(measured ~1600 IOPS on EBS gp3, invariant across block size).

For whole-sector transfers (offset and every iov_len a multiple of
BSIZE) dispatch one bio per iovec through dev->driver->devops->strategy
(multiplex_strategy), which splits by max_io_size and issues all
children before waiting. N concurrent callers now keep N requests in
flight. strategy() adds dev->offset exactly once, matching the prior
rw_buf()->strategy() path, so partition addressing is unchanged.

The unaligned/sub-sector case keeps the buffer-cache fallback but fixes
two latent bugs the removed debug-only asserts had masked under NDEBUG:
reads now copy from bp->b_data + (offset % BSIZE) instead of the block
start, and writes read-modify-write via bread instead of getblk so the
untouched remainder of the sector is preserved.

ZFS and rofs call devops->strategy directly and never enter bdev_read,
so they are unaffected; the only remaining buffer-cache consumer is the
one-shot partition-table read at device attach.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant