Skip to content

2.3.4 staging prep #17595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 61 commits into from
Aug 15, 2025
Merged

Conversation

amotin
Copy link
Member

@amotin amotin commented Aug 5, 2025

Most of these commits we've already merged into TrueNAS, few others we'd like to, if there are no objections.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

amotin and others added 21 commits August 5, 2025 12:14
ZIL introduced dependencies between its write ZIOs to permit flush
defer, when we flush vdev caches only once all the write ZIOs has
completed.  But it was recently spotted that it serializes not only
ZIO completions handling, but also their ready stage.  It means ZIO
pipeline can't calculate checksums for the following ZIOs until all
the previous are checksumed, even though it is not required.  On a
systems where memory throughput of a single CPU core is limited,
it creates single-core CPU bottleneck, which is difficult to see
due to ZIO pipeline design with many taskqueue threads.

While it would be great to bypass the ready stage waits, it would
require changes to ZIO code, and I haven't found a clean way to do
it.  But I've noticed that we don't need any dependency between
the write ZIOs if the previous one has some waiters, which means
it won't defer any flushes and work as a barrier for the earlier
ones.

Bypassing it won't help large single-thread writes, since all the
write ZIOs except the last in that case won't have waiters, and
so will be dependent.  But in that case the ZIO processing might
not be a bottleneck, since there will be only one thread populating
the write buffers, that will likely be the bottleneck.

But bypassing the ZIO dependency on multi-threaded write workloads
really allows them to scale beyond the checksuming throughput of
one CPU core.

My tests with writing 12 files on a same dataset on a pool with
4 striped NVMes as SLOGs from 12 threads with 1MB blocks on a
system with Xeon Silver 4114 CPU show total throughput increase
from 4.3GB/s to 8.5GB/s, increasing the SLOGs busy from ~30% to
~70%.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#17458
This PR condenses the FDT dedup log syncing into a single sync
pass. This reduces the overhead of modifying indirect blocks for the
dedup table multiple times per txg. In addition, changes were made to
the formula for how much to sync per txg. We now also consider the
backlog we have to clear, to prevent it from growing too large, or
remaining large on an idle system.

Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Authored-by: Don Brady <[email protected]>
Authored-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes openzfs#17038
- Don't drop L2ARC header if we have more buffers in this header.
Since we leave them the header, leave them the L2ARC header also.
Honestly we are not required to drop it even if there are no other
buffers, but then we'd need to allocate it a separate header, which
we might drop soon if the old block is really deleted.  Multiple
buffers in a header likely mean active snapshots or dedup, so we
know that the block in L2ARC will remain valid.  It might be rare,
but why not?
 - Remove some impossible assertions and conditions.

Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#17126
The `scn_min_txg` can now be used not only with resilver. Instead
of checking `scn_min_txg` to determine whether it’s a resilver or
a scrub, simply check which function is defined. Thanks to this
change, a scrub_finish event is generated when performing a scrub
from the saved txg.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Mariusz Zaborski <[email protected]>
Closes openzfs#17432
dbuf_verify(): Don't need the lock, since we only compare pointers.

dbuf_findbp(): Don't need the lock, since aside of unneeded assert
we only produce the pointer, but don't de-reference it.

dnode_next_offset_level(): When working on top level indirection
should lock dnode buffer's db_rwlock, since it is our parent.  If
dnode has no buffer, then it is meta-dnode or one of quotas and we
should lock the dataset's ds_bp_rwlock instead.

Reviewed-by: Alan Somers <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#17441
There are still a variety of bugs involving the vdev_nonrot property
that will cause problems if you try to run the test suite with
segment-based weighting disabled, and with other things in the weighting
code. Parents' nonrot property need to be updated when children are
added. When vdevs are expanded and more metaslabs are added, the weights
have to be recalculated (since the number of metaslabs is an input to
the lba bias function). When opening, faulted or unopenable children
should not be considered for whether a vdev is nonrot or not (since the
nonrot property is determined during a successful open, this can cause
false negatives). And draid spares need to have the nonrot property set
correctly.

Sponsored-by: Eshtek, creators of HexOS
Sponsored-by: Klara, Inc.
Reviewed-by: Allan Jude <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes openzfs#17469
Use statx to verify that path-based unmounts proceed only if the
mountpoint reported by statx matches the MNTTAB entry reported by
libzfs, aborting the operation if they differ. Align
`zfs umount /path` behavior with `zfs umount dataset`.

Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#17481
Currently, after a failed allocation, the metaslab code recalculates the
weight for a metaslab. However, for space-based metaslabs, it uses the
maximum free segment size instead of the normal weighting
algorithm. This is presumably because the normal metaslab weight is
(roughly) intended to estimate the size of the largest free segment, but
it doesn't do that reliably at most fragmentation levels. This means
that recalculated metaslabs are forced to a weight that isn't really
using the same units as the rest of them, resulting in undesirable
behaviors. We switch this to use the normal space-weighting function.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara, Inc.
Closes openzfs#17531
Under parallel workloads ZIL may delay writes of open LWBs that
are not full enough.  On suspend we do not expect anything new to
appear since zil_get_commit_list() will not let it pass, only
returning TXG number to wait for.  But I suspect that waiting for
the TXG commit without having the last LWB issued may not wait for
its completion, resulting in panic described in openzfs#17509.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#17521
On Linux, when doing path lookup with LOOKUP_RCU, dentry and inode can
be dereferenced without refcounts and locks. For this reason, dentry and
inode must only be freed after RCU grace period.

However, zfs currently frees inode in zfs_inode_destroy synchronously
and we can't use GPL-only call_rcu() in zfs directly. Fortunately, on
Linux 5.2 and after, if we define sops->free_inode(), the kernel will do
call_rcu() for us.

This issue may be triggered more easily with init_on_free=1 boot
parameter:

BUG: kernel NULL pointer dereference, address: 0000000000000020
RIP: 0010:selinux_inode_permission+0x10e/0x1c0
Call Trace:
 ? show_trace_log_lvl+0x1be/0x2d9
 ? show_trace_log_lvl+0x1be/0x2d9
 ? show_trace_log_lvl+0x1be/0x2d9
 ? security_inode_permission+0x37/0x60
 ? __die_body.cold+0x8/0xd
 ? no_context+0x113/0x220
 ? exc_page_fault+0x6d/0x130
 ? asm_exc_page_fault+0x1e/0x30
 ? selinux_inode_permission+0x10e/0x1c0
 security_inode_permission+0x37/0x60
 link_path_walk.part.0.constprop.0+0xb5/0x360
 ? path_init+0x27d/0x3c0
 path_lookupat+0x3e/0x1a0
 filename_lookup+0xc0/0x1d0
 ? __check_object_size.part.0+0x123/0x150
 ? strncpy_from_user+0x4e/0x130
 ? getname_flags.part.0+0x4b/0x1c0
 vfs_statx+0x72/0x120
 ? ioctl_has_perm.constprop.0.isra.0+0xbd/0x120
 __do_sys_newlstat+0x39/0x70
 ? __x64_sys_ioctl+0x8d/0xd0
 do_syscall_64+0x30/0x40
 entry_SYSCALL_64_after_hwframe+0x62/0xc7

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Co-authored-by: Chunwei Chen <[email protected]>
Closes openzfs#17546
During hotplug REMOVED events, devid matching fails for partition-based
spares because devid information is not stored in pool config for
partitioned devices. However, when devid is populated by the hotplug
event, the original code skipped the search logic entirely, skipping
vdev_guid matching and resulting in wrong device type detection that
caused spares to be incorrectly identified as l2arc devices.
Additionally, fix zfs_agent_iter_pool() to use the return value from
zfs_agent_iter_vdev() instead of relying on search parameters, which
was previously ignored. Also add pool_guid optimization to enable
targeted pool searching when pool_guid is available.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#17545
Currently, when reading compressed blocks with -R and decompressing
them with :d option and specifying lsize, which is normally bigger
than psize for compressed blocks, the checksum is calculated on
decompressed data. But it makes no sense since zfs always calculates
checksum on physical, i.e. compressed data. So reading the same block
produces different checksum results depending on how we read it,
whether we decompress it or not, which, again, makes no sense.

Fix: use psize instead of lsize when calculating the checksum so that
it is always calculated on the physical block size, no matter was it
compressed or not.

Signed-off-by: Andriy Tkachuk <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Closes openzfs#17547
Update the default FICLONE and FICLONERANGE ioctl behavior to wait
on dirty blocks.  While this does remove some control from the
application, in practice ZFS is better positioned to the optimial
thing and immediately force a TXG sync.

Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#17455
When we're passivating a metaslab group we start by passivating the 
metaslabs that have been activated for each of the allocators.  To do 
that, we need to provide a weight. However, currently this erroneously 
always uses a segment-based weight, even if segment-based weighting is 
disabled.

Use the normal weight function, which will decide which type of weight 
to use.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes openzfs#17566
While booting, only the needed 256KiB benchmarks are done now.

The delay for checking all checksums occurs when requested via:
- Linux: cat /proc/spl/kstat/zfs/chksum_bench
- FreeBSD: sysctl kstat.zfs.misc.chksum_bench

Reported by: Lahiru Gunathilake <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Tino Reichardt <[email protected]>
Co-authored-by: Colin Percival <[email protected]>
Closes openzfs#17563
Closes openzfs#17560
Sponsored-by: Klara, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Igor Ostapenko <[email protected]>
Closes openzfs#17581
…17073)

The redundant_metadata setting in ZFS allows users to trade resilience
for performance and space savings. This applies to all data and metadata
blocks in zfs, with one exception: gang blocks. Gang blocks currently
just take the copies property of the IO being ganged and, if it's 1,
sets it to 2. This means that we always make at least two copies of a
gang header, which is good for resilience. However, if the users care
more about performance than resilience, their gang blocks will be even
more of a penalty than usual.

We add logic to calculate the number of gang headers copies directly,
and store it as a separate IO property. This is stored in the IO
properties and not calculated when we decide to gang because by that
point we may not have easy access to the relevant information about what
kind of block is being stored. We also check the redundant_metadata
property when doing so, and use that to decide whether to store an extra
copy of the gang headers, compared to the underlying blocks.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Paul Dagnelie <[email protected]>
Co-authored-by: Paul Dagnelie <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Missed in openzfs#17073, probably because that PR was branched before openzfs#17001
was landed and never rebased.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
As discussed in the comments of PR openzfs#17004, you can theoretically run
into a case where a gang child has more copies than the gang header,
which can lead to some odd accounting behavior (and even trip a
VERIFY). While the accounting code could be changed to handle this, it
fundamentally doesn't seem to make a lot of sense to allow this to
happen. If the data is supposed to have a certain level of reliability,
that isn't actually achieved unless the gang_copies property is set to
match it.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes openzfs#17484
Loss of one indirect block of the meta dnode likely means loss of
the whole dataset.  It is worse than one file that the man page
promises, and in my opinion is not much better than "none" mode.

This change restores redundancy of the meta-dnode indirect blocks,
while same time still corrects expectations in the man page.

Reviewed-by: Akash B <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#17339
@amotin amotin mentioned this pull request Aug 5, 2025
14 tasks
@shodanshok
Copy link
Contributor

Maybe #17542 also?

@robn
Copy link
Member

robn commented Aug 7, 2025

I'm fine with also these going in. I don't think its the whole list of course, but we usually do a few rounds of these PRs. I'll start daily driving this set and will put together my own list of wants soon (but for sure #17542 is critical imo).

@satmandu
Copy link
Contributor

satmandu commented Aug 7, 2025

Is #17561 too new? Getting as many of the PRS that handle memory pressure issues as are deemed safe into a point version update would be nice.

robn and others added 12 commits August 12, 2025 17:23
All this machinery is there to try to understand when there an async
writeback waiting to complete because the intent log callbacks are still
outstanding, and force them with a timely zil_commit(). The next commit
fixes this properly, so there's no need for all this extra housekeeping.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17584
For async page writeback, we do not need to wait for the page to be on
disk before returning to the caller; it's enough that the data from the
dirty page be on the DMU and in the in-memory ZIL, just like any other
write.

So, if this is not a syncing write, don't add a callback to the itx, and
instead just unlock the page immediately.

(This is effectively the same concept used for FreeBSD in d323fbf).

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17584
Closes openzfs#14290
The structure of zfs_putpage() and its callers is tricky to follow.
There's a lot more we could do to improve it, but at least now we have
some description of one of the trickier bits.

Writing this exposed a very subtle bug: most async pages pushed out
through zpl_putpages() would go to the ZIL with commit=false, which can
yield a less-efficient write policy. So this commit updates that too.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17584
Currently we fail the compilation via the #error directive if
`HAVE_XSAVE` isn't defined. This breaks i586 builds since we check
the toolchains SIMD support only on i686 and onward.

Remove the requirement to fix the build on i586.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Attila Fülöp <[email protected]>
Closes openzfs#13303
Closes openzfs#17590
Be standard-compliant by using `int main()`.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Attila Fülöp <[email protected]>
Closes openzfs#13303
Closes openzfs#17590
The location of zgenhostid was changed in 0ae733c (Install zgenhostid
to sbindir, 2021-01-21).  We include all files within sbindir two lines
earlier, which causes rpmbuild to report:

    File listed twice: /sbin/zgenhostid

Drop the redundant entry from the %files section.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Todd Zullinger <[email protected]>
Closes openzfs#17601
03987f7 (openzfs#16069) added a workaround to get the blk-mq hardware
context for older kernels that don't cache it in the struct request.
However, this workaround appears to be incomplete.

In 4.19, the rq data context is optional. If its not initialised, then
the cached rq->cpu will be -1, and so using it to index into mq_map
causes a crash.

Given that the upstream 4.19 is now in extended LTS and rarely seen,
RHEL8 4.18+ has long carried "modern" blk-mq support, and the cached
hardware context has been available since 5.1, I'm not going to huge
lengths to get queue selection correct for the very few people that are
likely to feel it. To that end, we simply call raw_smp_processor_id() to
get a valid CPU id and use that instead.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes openzfs#17597
This converts the body of a ZED slack notification from
plain text to code block style to help with readability.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: René Wirnata <[email protected]>
Closes openzfs#17610
Update the META file to reflect compatibility with the 6.16
kernel.

Tested with 6.16.0-0-stable of Alpine Linux edge, see
<https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/87929>.

Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Achill Gilgenast <[email protected]>
Closes openzfs#17578
Chase URL change from the FreeBSD project.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Colin Percival <[email protected]>
Closes openzfs#17617
Allow zstd_mempool_init() to allocate using vmem_alloc() instead
of kmem_alloc() to silence the large allocation warning on Linux
during module load when the system has a large number of CPUs.

It's not at all clear to me that scaling the allocation size with
the number of CPUs is beneficial and that should be evaluated.
But for the moment this should resolve the warning without
introducing any unexpected side effects.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#17620
Closes openzfs#11557
Systems with a large number of CPU cores (192+) may trigger the large
allocation warning in multilist_create() on Linux.  Silence the warning
by converting the allocation to vmem_alloc().

On Linux this results in a call to kvalloc() which will alloc vmem
for large allocations and kmem for small allocations.

On FreeBSD both vmem_alloc and kmem_alloc internally use the same
allocator so there is no functional change.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#17616
@satmandu
Copy link
Contributor

For what it is worth, I use dkms to build the zfs modules on ubuntu 25.04. Trying to build from this patchset I couldn't get the modules to build with dkms on kernel 6.16 unless I reverted the objtool patches #17541 and #17456 . (I did not try earlier kernels.)

With those reverted, and with the addition of #17621, I am able to build for both 6.16.0 and 6.17-rc1.

@behlendorf
Copy link
Contributor

@amotin looks good. Regarding "zfs rewrite", normally I'd avoid adding a new sub-command but in this case since it doesn't modify any of the library ABIs and the code is so well isolated I'm okay with pulling it in. It's definitely a nice thing to make available before 2.4 is released.

Can you also add to this PR these commits from my zfs-2.3.4-staging-part2 branch which is rebased on top of your branch. Here's a break down of the additional commits. The #17584's changes are still pretty fresh, but as long as you and @robn don't have any concerns I think they're ready to pull in.

CI and test suite fixes:

46de04d FreeBSD 15.0 is now "PRERELEASE"
df5e02d CI: match and trim out internal timestamp for test prefix
245adb6 ZTS: include microsecond timestamps on all output
82a0868 CI: Remove Debian backports
8c4f625 CI: Add CentOS Stream 9/10 to the FULL_OS runner list
7882e85 Delete unused .cirrus.yml
6b38d0f ZTS: Fix FreeBSD 15.0 ksh errors
80b6457 CI: Switch from FreeBSD 13.4 to 13.5
2518f4b Revert "Fix incorrect expected error in ztest"
90d2c44 ztest: Fix false positive of ENOSPC handling
f7698f4 CI: run ztest on compressed zpool
024e60b Missing tests in make pkg
094305c Fix TestGroup warning due to missing tags
a826f7a ZTS: Use FreeBSD cloudinit images

Build and packaging fixes:

41ca229 Linux 6.16 compat: META
a49c957 linux/zvol_os: fix crash with blk-mq on Linux 4.19
d1d7063 rpm: don't list /sbin/zgenhostid twice in %files
11f8441 config: Avoid void main() in toolchain-simd.m4
57b614e SIMD: Don't require definition of HAVE_XSAVE
e7e0bb3 linux: Fix out-of-src builds
6c1130a pyzfs: Adapt python lib directory evaluation from ax_python_devel.m4
74b539d pyzfs: Update ax_python_devel.m4 to serial 37

Minor bug fixes:

0fe1036 Allow vmem_alloc backed multilists
3e78905 Silence zstd large allocation warning
6af1f61 Fix zdb pool/ with -k
5289f6f spa: ZIO_TASKQ_ISSUE: Use symbolic priority
9651668 zed: prettify slack notification message

PR #17584 performance fix for Linux async page writeback.

0c7d6e2 Linux: zfs_putpage: document (and fix!) confusing sync/commit modes
b9c45fe Linux: zfs_putpage: complete async page writeback immediately
f72226a Linux: sync: remove async/sync accounting
97fe868 ZTS: mmap_ftruncate test to confirm async writeback behaviour

@robn
Copy link
Member

robn commented Aug 13, 2025

I'm fine with #17584. I've been running it locally for about a week.

Also should include #17533, same areas at #17584, but we already shipped the "bad" version of that in 2.3.3, this one is better.

I'll start running this new branch later today, and I still need to check if there's anything else I want to ship.

markjdb and others added 2 commits August 12, 2025 22:41
…completes"

This causes async putpages to leave the pages sbusied for a long time,
which hurts concurrency.  Revert for now until we have a better
approach.

This reverts commit 238eab7.

Reported by:    Ihor Antonov <[email protected]>
Discussed with: Rob Norris <[email protected]>

References: freebsd/freebsd-src@738a9a7
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Mark Johnston <[email protected]>
Ported-by: Rob Norris <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17533
In syncing mode, zfs_putpages() would put the entire range of pages onto
the ZIL, then return VM_PAGER_OK for each page to the kernel. However,
an associated zil_commit() or txg sync had not happened at this point,
so the write may not actually be on disk.

So, we rework that case to use a ZIL commit callback, and do the
post-write work of undirtying the page and signaling completion there.
We return VM_PAGER_PEND to the kernel instead so it knows that we will
take care of it.

The original version of this (238eab7) copied the Linux model and did
the cleanup in a ZIL callback for both async and sync. This was a
mistake, as FreeBSD does not have a separate "busy for writeback" flag
like Linux which keeps the page usable. The full sbusy flag locks the
entire page out until the itx callback fires, which for async is after
txg sync, which could be literal seconds in the future.

For the async case, the data is already on the DMU and the in-memory
ZIL, which is sufficient for async writeback, so the old method of
logging it without a callback, undirtying the page and returning is more
than sufficient and reclaims that lost performance.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Mark Johnston <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17533
@amotin
Copy link
Member Author

amotin commented Aug 13, 2025

Can you also add to this PR these commits from my zfs-2.3.4-staging-part2 branch which is rebased on top of your branch.

Added.

Also should include #17533

Included.

@AttilaFueloep
Copy link
Contributor

@satmandu Can you open an issue and share the make.log?

@mabod
Copy link

mabod commented Aug 13, 2025

When I try to compile zfs-dkms 2.3.3 with this patch (on endeavouros) the patch does not apply cleanly.

I am retrieving the patch in the PKGBUILD with this source:
https://patch-diff.githubusercontent.com/raw/openzfs/zfs/pull/17595.patch

Output of patch -p1 -i ../17595.patch:

patching file module/zfs/zil.c
patching file include/sys/ddt.h
patching file include/sys/vdev.h
...
patching file config/zfs-build.m4
patching file configure.ac
patching file module/Makefile.in
can't find file to patch at input line 5705
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|diff --git a/scripts/.gitignore b/scripts/.gitignore
|index 5621a6e147a0..443cb7b8484e 100644
|--- a/scripts/.gitignore
|+++ b/scripts/.gitignore
--------------------------
File to patch:

If I just hit return more missing file errors appear.

@AttilaFueloep
Copy link
Contributor

Just deleted my last two comments, was on a wrong branch. Sorry for the confusion.

@behlendorf
Copy link
Contributor

@mabod some files which are present in the git repository aren't included in the make dist tarball. For example, the .gitignore and CI files under .github. You can safely manually remove those hunks from the patch file if you're patching a release tarball. Otherwise you'll need to apply these changes to a clone of the repository.

@AttilaFueloep
Copy link
Contributor

Adding to what Brian wrote, the patch you've downloaded is against the 2.3.4 staging branch and won't apply to 2.3.3. You'll need to change the sources entry in the PKGBUILD to git clone the 2.3.4 staging branch and apply the patch on top of that. See PKGBUILD(5).

@tonyhutter tonyhutter merged commit 3b64a96 into openzfs:zfs-2.3.4-staging Aug 15, 2025
37 of 41 checks passed
@tonyhutter
Copy link
Contributor

Merged as:
3b64a96 FreeBSD: zfs_putpages: don't undirty pages until after write completes
a072611 Revert "FreeBSD: zfs_putpages: don't undirty pages until after write completes"
0fe1036 Allow vmem_alloc backed multilists
3e78905 Silence zstd large allocation warning
46de04d FreeBSD 15.0 is now "PRERELEASE"
41ca229 Linux 6.16 compat: META
9651668 zed: prettify slack notification message
a49c957 linux/zvol_os: fix crash with blk-mq on Linux 4.19
d1d7063 rpm: don't list /sbin/zgenhostid twice in %files
11f8441 config: Avoid void main() in toolchain-simd.m4
57b614e SIMD: Don't require definition of HAVE_XSAVE
0c7d6e2 Linux: zfs_putpage: document (and fix!) confusing sync/commit modes
b9c45fe Linux: zfs_putpage: complete async page writeback immediately
f72226a Linux: sync: remove async/sync accounting
97fe868 ZTS: mmap_ftruncate test to confirm async writeback behaviour
df5e02d CI: match and trim out internal timestamp for test prefix
245adb6 ZTS: include microsecond timestamps on all output
82a0868 CI: Remove Debian backports
e7e0bb3 linux: Fix out-of-src builds
6af1f61 Fix zdb pool/ with -k
8c4f625 CI: Add CentOS Stream 9/10 to the FULL_OS runner list
7882e85 Delete unused .cirrus.yml
6b38d0f ZTS: Fix FreeBSD 15.0 ksh errors
80b6457 CI: Switch from FreeBSD 13.4 to 13.5
2518f4b Revert "Fix incorrect expected error in ztest"
90d2c44 ztest: Fix false positive of ENOSPC handling
f7698f4 CI: run ztest on compressed zpool
6c1130a pyzfs: Adapt python lib directory evaluation from ax_python_devel.m4
74b539d pyzfs: Update ax_python_devel.m4 to serial 37
024e60b Missing tests in make pkg
5289f6f spa: ZIO_TASKQ_ISSUE: Use symbolic priority
094305c Fix TestGroup warning due to missing tags
a826f7a ZTS: Use FreeBSD cloudinit images
86bf73c objtool wrapper: use absolute path to call the wrapper
1d293b3 Linux build: handle CONFIG_OBJTOOL_WERROR=y
22eb2bd Make TX abort after assign safer
809b553 Introduce zfs rewrite subcommand (#17246)
abb6211 Linux 6.16: remove writepage and readahead_page
c405a7a Skip dbuf_evict_one() from dbuf_evict_notify() for reclaim thread
4808641 enforce arc_dnode_limit
30fa92b Increase meta-dnode redundancy in "some" mode
fd5a27c Ensure that gang_copies is always at least as large as copies
3ad3f43 zts: add spdx license tags to gang_blocks tests (#17160)
a46ce73 Make ganging redundancy respect redundant_metadata property (#17073)
9079095 SPDX: Add missing CDDL-1.0 license
95abbc7 range_tree: Provide more debug details upon unexpected add/remove
fc658b9 Faster checksum benchmark on system boot
271b979 Don't use wrong weight when passivating group
582e784 Default to zfs_bclone_wait_dirty=1
6d37856 zdb: fix checksum calculation for decompressed blocks
0c928f7 ZED: Fix device type detection and pool iteration logic
c79d5e4 Define sops->free_inode() to prevent use-after-free during lookup
347d680 ZIL: Force writing of open LWB on suspend
acf3871 Correct weight recalculation of space-based metaslabs
21d5f25 Validate mountpoint on path-based unmount using statx
7e945a5 Fix other nonrot bugs
85ce6b8 Polish db_rwlock scope
954894e scrub: generate scrub_finish event
a4e775d Some arc_release() cleanup
661310f FDT dedup log sync -- remove incremental
f9d59b5 ZIL: Relax parallel write ZIOs processing

@amotin amotin deleted the zfs-2.3.4-staging branch August 15, 2025 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.