-
Notifications
You must be signed in to change notification settings - Fork 1.9k
2.3.4 staging prep #17595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2.3.4 staging prep #17595
Conversation
ZIL introduced dependencies between its write ZIOs to permit flush defer, when we flush vdev caches only once all the write ZIOs has completed. But it was recently spotted that it serializes not only ZIO completions handling, but also their ready stage. It means ZIO pipeline can't calculate checksums for the following ZIOs until all the previous are checksumed, even though it is not required. On a systems where memory throughput of a single CPU core is limited, it creates single-core CPU bottleneck, which is difficult to see due to ZIO pipeline design with many taskqueue threads. While it would be great to bypass the ready stage waits, it would require changes to ZIO code, and I haven't found a clean way to do it. But I've noticed that we don't need any dependency between the write ZIOs if the previous one has some waiters, which means it won't defer any flushes and work as a barrier for the earlier ones. Bypassing it won't help large single-thread writes, since all the write ZIOs except the last in that case won't have waiters, and so will be dependent. But in that case the ZIO processing might not be a bottleneck, since there will be only one thread populating the write buffers, that will likely be the bottleneck. But bypassing the ZIO dependency on multi-threaded write workloads really allows them to scale beyond the checksuming throughput of one CPU core. My tests with writing 12 files on a same dataset on a pool with 4 striped NVMes as SLOGs from 12 threads with 1MB blocks on a system with Xeon Silver 4114 CPU show total throughput increase from 4.3GB/s to 8.5GB/s, increasing the SLOGs busy from ~30% to ~70%. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#17458
This PR condenses the FDT dedup log syncing into a single sync pass. This reduces the overhead of modifying indirect blocks for the dedup table multiple times per txg. In addition, changes were made to the formula for how much to sync per txg. We now also consider the backlog we have to clear, to prevent it from growing too large, or remaining large on an idle system. Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Authored-by: Don Brady <[email protected]> Authored-by: Paul Dagnelie <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes openzfs#17038
- Don't drop L2ARC header if we have more buffers in this header. Since we leave them the header, leave them the L2ARC header also. Honestly we are not required to drop it even if there are no other buffers, but then we'd need to allocate it a separate header, which we might drop soon if the old block is really deleted. Multiple buffers in a header likely mean active snapshots or dedup, so we know that the block in L2ARC will remain valid. It might be rare, but why not? - Remove some impossible assertions and conditions. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#17126
The `scn_min_txg` can now be used not only with resilver. Instead of checking `scn_min_txg` to determine whether it’s a resilver or a scrub, simply check which function is defined. Thanks to this change, a scrub_finish event is generated when performing a scrub from the saved txg. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Mariusz Zaborski <[email protected]> Closes openzfs#17432
dbuf_verify(): Don't need the lock, since we only compare pointers. dbuf_findbp(): Don't need the lock, since aside of unneeded assert we only produce the pointer, but don't de-reference it. dnode_next_offset_level(): When working on top level indirection should lock dnode buffer's db_rwlock, since it is our parent. If dnode has no buffer, then it is meta-dnode or one of quotas and we should lock the dataset's ds_bp_rwlock instead. Reviewed-by: Alan Somers <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#17441
There are still a variety of bugs involving the vdev_nonrot property that will cause problems if you try to run the test suite with segment-based weighting disabled, and with other things in the weighting code. Parents' nonrot property need to be updated when children are added. When vdevs are expanded and more metaslabs are added, the weights have to be recalculated (since the number of metaslabs is an input to the lba bias function). When opening, faulted or unopenable children should not be considered for whether a vdev is nonrot or not (since the nonrot property is determined during a successful open, this can cause false negatives). And draid spares need to have the nonrot property set correctly. Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes openzfs#17469
Use statx to verify that path-based unmounts proceed only if the mountpoint reported by statx matches the MNTTAB entry reported by libzfs, aborting the operation if they differ. Align `zfs umount /path` behavior with `zfs umount dataset`. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes openzfs#17481
Currently, after a failed allocation, the metaslab code recalculates the weight for a metaslab. However, for space-based metaslabs, it uses the maximum free segment size instead of the normal weighting algorithm. This is presumably because the normal metaslab weight is (roughly) intended to estimate the size of the largest free segment, but it doesn't do that reliably at most fragmentation levels. This means that recalculated metaslabs are forced to a weight that isn't really using the same units as the rest of them, resulting in undesirable behaviors. We switch this to use the normal space-weighting function. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Wasabi Technology, Inc. Sponsored-by: Klara, Inc. Closes openzfs#17531
Under parallel workloads ZIL may delay writes of open LWBs that are not full enough. On suspend we do not expect anything new to appear since zil_get_commit_list() will not let it pass, only returning TXG number to wait for. But I suspect that waiting for the TXG commit without having the last LWB issued may not wait for its completion, resulting in panic described in openzfs#17509. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#17521
On Linux, when doing path lookup with LOOKUP_RCU, dentry and inode can be dereferenced without refcounts and locks. For this reason, dentry and inode must only be freed after RCU grace period. However, zfs currently frees inode in zfs_inode_destroy synchronously and we can't use GPL-only call_rcu() in zfs directly. Fortunately, on Linux 5.2 and after, if we define sops->free_inode(), the kernel will do call_rcu() for us. This issue may be triggered more easily with init_on_free=1 boot parameter: BUG: kernel NULL pointer dereference, address: 0000000000000020 RIP: 0010:selinux_inode_permission+0x10e/0x1c0 Call Trace: ? show_trace_log_lvl+0x1be/0x2d9 ? show_trace_log_lvl+0x1be/0x2d9 ? show_trace_log_lvl+0x1be/0x2d9 ? security_inode_permission+0x37/0x60 ? __die_body.cold+0x8/0xd ? no_context+0x113/0x220 ? exc_page_fault+0x6d/0x130 ? asm_exc_page_fault+0x1e/0x30 ? selinux_inode_permission+0x10e/0x1c0 security_inode_permission+0x37/0x60 link_path_walk.part.0.constprop.0+0xb5/0x360 ? path_init+0x27d/0x3c0 path_lookupat+0x3e/0x1a0 filename_lookup+0xc0/0x1d0 ? __check_object_size.part.0+0x123/0x150 ? strncpy_from_user+0x4e/0x130 ? getname_flags.part.0+0x4b/0x1c0 vfs_statx+0x72/0x120 ? ioctl_has_perm.constprop.0.isra.0+0xbd/0x120 __do_sys_newlstat+0x39/0x70 ? __x64_sys_ioctl+0x8d/0xd0 do_syscall_64+0x30/0x40 entry_SYSCALL_64_after_hwframe+0x62/0xc7 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Co-authored-by: Chunwei Chen <[email protected]> Closes openzfs#17546
During hotplug REMOVED events, devid matching fails for partition-based spares because devid information is not stored in pool config for partitioned devices. However, when devid is populated by the hotplug event, the original code skipped the search logic entirely, skipping vdev_guid matching and resulting in wrong device type detection that caused spares to be incorrectly identified as l2arc devices. Additionally, fix zfs_agent_iter_pool() to use the return value from zfs_agent_iter_vdev() instead of relying on search parameters, which was previously ignored. Also add pool_guid optimization to enable targeted pool searching when pool_guid is available. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes openzfs#17545
Currently, when reading compressed blocks with -R and decompressing them with :d option and specifying lsize, which is normally bigger than psize for compressed blocks, the checksum is calculated on decompressed data. But it makes no sense since zfs always calculates checksum on physical, i.e. compressed data. So reading the same block produces different checksum results depending on how we read it, whether we decompress it or not, which, again, makes no sense. Fix: use psize instead of lsize when calculating the checksum so that it is always calculated on the physical block size, no matter was it compressed or not. Signed-off-by: Andriy Tkachuk <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Closes openzfs#17547
Update the default FICLONE and FICLONERANGE ioctl behavior to wait on dirty blocks. While this does remove some control from the application, in practice ZFS is better positioned to the optimial thing and immediately force a TXG sync. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#17455
When we're passivating a metaslab group we start by passivating the metaslabs that have been activated for each of the allocators. To do that, we need to provide a weight. However, currently this erroneously always uses a segment-based weight, even if segment-based weighting is disabled. Use the normal weight function, which will decide which type of weight to use. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes openzfs#17566
While booting, only the needed 256KiB benchmarks are done now. The delay for checking all checksums occurs when requested via: - Linux: cat /proc/spl/kstat/zfs/chksum_bench - FreeBSD: sysctl kstat.zfs.misc.chksum_bench Reported by: Lahiru Gunathilake <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Co-authored-by: Colin Percival <[email protected]> Closes openzfs#17563 Closes openzfs#17560
Sponsored-by: Klara, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Igor Ostapenko <[email protected]> Closes openzfs#17581
Signed-off-by: Ameer Hamza <[email protected]>
…17073) The redundant_metadata setting in ZFS allows users to trade resilience for performance and space savings. This applies to all data and metadata blocks in zfs, with one exception: gang blocks. Gang blocks currently just take the copies property of the IO being ganged and, if it's 1, sets it to 2. This means that we always make at least two copies of a gang header, which is good for resilience. However, if the users care more about performance than resilience, their gang blocks will be even more of a penalty than usual. We add logic to calculate the number of gang headers copies directly, and store it as a separate IO property. This is stored in the IO properties and not calculated when we decide to gang because by that point we may not have easy access to the relevant information about what kind of block is being stored. We also check the redundant_metadata property when doing so, and use that to decide whether to store an extra copy of the gang headers, compared to the underlying blocks. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Paul Dagnelie <[email protected]> Co-authored-by: Paul Dagnelie <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
Missed in openzfs#17073, probably because that PR was branched before openzfs#17001 was landed and never rebased. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
As discussed in the comments of PR openzfs#17004, you can theoretically run into a case where a gang child has more copies than the gang header, which can lead to some odd accounting behavior (and even trip a VERIFY). While the accounting code could be changed to handle this, it fundamentally doesn't seem to make a lot of sense to allow this to happen. If the data is supposed to have a certain level of reliability, that isn't actually achieved unless the gang_copies property is set to match it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes openzfs#17484
Loss of one indirect block of the meta dnode likely means loss of the whole dataset. It is worse than one file that the man page promises, and in my opinion is not much better than "none" mode. This change restores redundancy of the meta-dnode indirect blocks, while same time still corrects expectations in the man page. Reviewed-by: Akash B <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#17339
Maybe #17542 also? |
I'm fine with also these going in. I don't think its the whole list of course, but we usually do a few rounds of these PRs. I'll start daily driving this set and will put together my own list of wants soon (but for sure #17542 is critical imo). |
Is #17561 too new? Getting as many of the PRS that handle memory pressure issues as are deemed safe into a point version update would be nice. |
All this machinery is there to try to understand when there an async writeback waiting to complete because the intent log callbacks are still outstanding, and force them with a timely zil_commit(). The next commit fixes this properly, so there's no need for all this extra housekeeping. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17584
For async page writeback, we do not need to wait for the page to be on disk before returning to the caller; it's enough that the data from the dirty page be on the DMU and in the in-memory ZIL, just like any other write. So, if this is not a syncing write, don't add a callback to the itx, and instead just unlock the page immediately. (This is effectively the same concept used for FreeBSD in d323fbf). Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17584 Closes openzfs#14290
The structure of zfs_putpage() and its callers is tricky to follow. There's a lot more we could do to improve it, but at least now we have some description of one of the trickier bits. Writing this exposed a very subtle bug: most async pages pushed out through zpl_putpages() would go to the ZIL with commit=false, which can yield a less-efficient write policy. So this commit updates that too. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17584
Currently we fail the compilation via the #error directive if `HAVE_XSAVE` isn't defined. This breaks i586 builds since we check the toolchains SIMD support only on i686 and onward. Remove the requirement to fix the build on i586. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Attila Fülöp <[email protected]> Closes openzfs#13303 Closes openzfs#17590
Be standard-compliant by using `int main()`. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Attila Fülöp <[email protected]> Closes openzfs#13303 Closes openzfs#17590
The location of zgenhostid was changed in 0ae733c (Install zgenhostid to sbindir, 2021-01-21). We include all files within sbindir two lines earlier, which causes rpmbuild to report: File listed twice: /sbin/zgenhostid Drop the redundant entry from the %files section. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Todd Zullinger <[email protected]> Closes openzfs#17601
03987f7 (openzfs#16069) added a workaround to get the blk-mq hardware context for older kernels that don't cache it in the struct request. However, this workaround appears to be incomplete. In 4.19, the rq data context is optional. If its not initialised, then the cached rq->cpu will be -1, and so using it to index into mq_map causes a crash. Given that the upstream 4.19 is now in extended LTS and rarely seen, RHEL8 4.18+ has long carried "modern" blk-mq support, and the cached hardware context has been available since 5.1, I'm not going to huge lengths to get queue selection correct for the very few people that are likely to feel it. To that end, we simply call raw_smp_processor_id() to get a valid CPU id and use that instead. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#17597
This converts the body of a ZED slack notification from plain text to code block style to help with readability. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: René Wirnata <[email protected]> Closes openzfs#17610
Update the META file to reflect compatibility with the 6.16 kernel. Tested with 6.16.0-0-stable of Alpine Linux edge, see <https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/87929>. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Achill Gilgenast <[email protected]> Closes openzfs#17578
Chase URL change from the FreeBSD project. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Colin Percival <[email protected]> Closes openzfs#17617
Allow zstd_mempool_init() to allocate using vmem_alloc() instead of kmem_alloc() to silence the large allocation warning on Linux during module load when the system has a large number of CPUs. It's not at all clear to me that scaling the allocation size with the number of CPUs is beneficial and that should be evaluated. But for the moment this should resolve the warning without introducing any unexpected side effects. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#17620 Closes openzfs#11557
Systems with a large number of CPU cores (192+) may trigger the large allocation warning in multilist_create() on Linux. Silence the warning by converting the allocation to vmem_alloc(). On Linux this results in a call to kvalloc() which will alloc vmem for large allocations and kmem for small allocations. On FreeBSD both vmem_alloc and kmem_alloc internally use the same allocator so there is no functional change. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#17616
For what it is worth, I use dkms to build the zfs modules on ubuntu 25.04. Trying to build from this patchset I couldn't get the modules to build with dkms on kernel 6.16 unless I reverted the objtool patches #17541 and #17456 . (I did not try earlier kernels.) With those reverted, and with the addition of #17621, I am able to build for both 6.16.0 and 6.17-rc1. |
@amotin looks good. Regarding "zfs rewrite", normally I'd avoid adding a new sub-command but in this case since it doesn't modify any of the library ABIs and the code is so well isolated I'm okay with pulling it in. It's definitely a nice thing to make available before 2.4 is released. Can you also add to this PR these commits from my zfs-2.3.4-staging-part2 branch which is rebased on top of your branch. Here's a break down of the additional commits. The #17584's changes are still pretty fresh, but as long as you and @robn don't have any concerns I think they're ready to pull in. CI and test suite fixes: 46de04d FreeBSD 15.0 is now "PRERELEASE" Build and packaging fixes: 41ca229 Linux 6.16 compat: META Minor bug fixes: 0fe1036 Allow vmem_alloc backed multilists PR #17584 performance fix for Linux async page writeback. 0c7d6e2 Linux: zfs_putpage: document (and fix!) confusing sync/commit modes |
I'm fine with #17584. I've been running it locally for about a week. Also should include #17533, same areas at #17584, but we already shipped the "bad" version of that in 2.3.3, this one is better. I'll start running this new branch later today, and I still need to check if there's anything else I want to ship. |
…completes" This causes async putpages to leave the pages sbusied for a long time, which hurts concurrency. Revert for now until we have a better approach. This reverts commit 238eab7. Reported by: Ihor Antonov <[email protected]> Discussed with: Rob Norris <[email protected]> References: freebsd/freebsd-src@738a9a7 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Mark Johnston <[email protected]> Ported-by: Rob Norris <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17533
In syncing mode, zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework that case to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. The original version of this (238eab7) copied the Linux model and did the cleanup in a ZIL callback for both async and sync. This was a mistake, as FreeBSD does not have a separate "busy for writeback" flag like Linux which keeps the page usable. The full sbusy flag locks the entire page out until the itx callback fires, which for async is after txg sync, which could be literal seconds in the future. For the async case, the data is already on the DMU and the in-memory ZIL, which is sufficient for async writeback, so the old method of logging it without a callback, undirtying the page and returning is more than sufficient and reclaims that lost performance. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Mark Johnston <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17533
Added.
Included. |
@satmandu Can you open an issue and share the |
When I try to compile zfs-dkms 2.3.3 with this patch (on endeavouros) the patch does not apply cleanly. I am retrieving the patch in the PKGBUILD with this source: Output of
If I just hit return more missing file errors appear. |
Just deleted my last two comments, was on a wrong branch. Sorry for the confusion. |
@mabod some files which are present in the git repository aren't included in the |
Adding to what Brian wrote, the patch you've downloaded is against the 2.3.4 staging branch and won't apply to 2.3.3. You'll need to change the sources entry in the PKGBUILD to git clone the 2.3.4 staging branch and apply the patch on top of that. See PKGBUILD(5). |
Merged as: |
Most of these commits we've already merged into TrueNAS, few others we'd like to, if there are no objections.
Types of changes
Checklist:
Signed-off-by
.