Skip to content

Conversation

ihoro
Copy link
Contributor

@ihoro ihoro commented Sep 26, 2025

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Motivation and Context

The situation seems to be mirror related only and it happens if one logic path takes SCL_ZIO reader lock, then another one lands scl_write_wanted by waiting for SCL_ZIO writer lock, and the first one asks for SCL_ZIO reader lock again.

Currently, pruning code path has only two entry points: zpool prune CLI and ztest_ddt_prune() test. Both of them call ddt_prune_unique_entries(), which begins and ends the pruning process by switching the spa->spa_active_ddt_prune bool flag.

The following is the actual example of the case happened with ztest. The paragraph number depicts the sequence of events.

1 ztest_ddt_prune() test running in a separate ztest thread enqueues dsl_sync_task(prune_candidates_sync), and keeps waiting for a txg_sync_thread to get it done.

#0  0x0000e54ce24e1e9c in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0xaab8e762e5a4) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0xaab8e762e5a4) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0xaab8e762e5a4, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x0000e54ce24e4b20 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0xaab8e762e4d0, cond=0xaab8e762e578) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=cond@entry=0xaab8e762e578, mutex=mutex@entry=0xaab8e762e4d0) at ./nptl/pthread_cond_wait.c:627
#5  0x0000e54ce2765df8 in cv_wait (cv=cv@entry=0xaab8e762e578, mp=mp@entry=0xaab8e762e4d0) at lib/libzpool/kernel.c:349
#6  0x0000e54ce2889750 in txg_wait_synced_flags (dp=dp@entry=0xaab8e762e080, txg=68687, flags=flags@entry=TXG_WAIT_NONE) at module/zfs/txg.c:761
#7  0x0000e54ce2889900 in txg_wait_synced (dp=dp@entry=0xaab8e762e080, txg=<optimized out>) at module/zfs/txg.c:772
#8  0x0000e54ce283be68 in dsl_sync_task_common (pool=<optimized out>, checkfunc=checkfunc@entry=0x0, syncfunc=syncfunc@entry=0xe54ce27cabf0 <prune_candidates_sync>, sigfunc=sigfunc@entry=0x0, arg=arg@entry=0xe54c0599b0c0, blocks_modified=blocks_modified@entry=0,
    space_check=space_check@entry=ZFS_SPACE_CHECK_NONE, early=early@entry=B_FALSE) at module/zfs/dsl_synctask.c:101
#9  0x0000e54ce283bec0 in dsl_sync_task (pool=<optimized out>, checkfunc=checkfunc@entry=0x0, syncfunc=syncfunc@entry=0xe54ce27cabf0 <prune_candidates_sync>, arg=arg@entry=0xe54c0599b0c0, blocks_modified=blocks_modified@entry=0, space_check=space_check@entry=ZFS_SPACE_CHECK_NONE)
    at module/zfs/dsl_synctask.c:140
#10 0x0000e54ce27cbb48 in ddt_prune_walk (spa=spa@entry=0xaab8e7507d60, cutoff=cutoff@entry=1755185730, histogram=histogram@entry=0x0) at module/zfs/ddt.c:2865
#11 0x0000e54ce27cbe00 in ddt_prune_unique_entries (spa=0xaab8e7507d60, unit=ZPOOL_DDT_PRUNE_PERCENTAGE, amount=<optimized out>) at module/zfs/ddt.c:2942
#12 0x0000aab8aa68ae14 in ztest_execute (test=test@entry=33, zi=zi@entry=0xaab8aa6c0578 <ztest_info+1056>, id=id@entry=5) at cmd/ztest.c:8345
#13 0x0000aab8aa68e018 in ztest_thread (arg=arg@entry=0x5) at cmd/ztest.c:8530
#14 0x0000e54ce276514c in zk_thread_wrapper (arg=<optimized out>) at lib/libzpool/kernel.c:91
#15 0x0000e54ce24e595c in start_thread (arg=0xe54ce270c760) at ./nptl/pthread_create.c:447
#16 0x0000e54ce254ba4c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone3.S:76

2 The txg_sync_thread() runs dsl_pool_sync() which invokes prune_candidates_sync sync task. The prune_candidates_sync() takes SCL_ZIO reader lock before the actual work.

3 Another thread asks and waits for SCL_ZIO writer lock via spa_vdev_state_enter(), in this case it was ztest_scrub. The lock gets scl_write_wanted++.

#0  0x0000e54ce24e1e9c in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0xaab8e7509ab0) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0xaab8e7509ab0) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0xaab8e7509ab0, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x0000e54ce24e4b20 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0xaab8e7509a40, cond=0xaab8e7509a88) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=cond@entry=0xaab8e7509a88, mutex=mutex@entry=0xaab8e7509a40) at ./nptl/pthread_cond_wait.c:627
#5  0x0000e54ce2765df8 in cv_wait (cv=cv@entry=0xaab8e7509a88, mp=mp@entry=0xaab8e7509a40) at lib/libzpool/kernel.c:349
#6  0x0000e54ce287dd4c in spa_config_enter_impl (spa=spa@entry=0xaab8e7507d60, locks=locks@entry=22, tag=tag@entry=0xaab8e7507d60, rw=rw@entry=1, mmp_flag=mmp_flag@entry=0) at module/zfs/spa_misc.c:538
#7  0x0000e54ce287e1e8 in spa_config_enter (spa=spa@entry=0xaab8e7507d60, locks=locks@entry=22, tag=tag@entry=0xaab8e7507d60, rw=rw@entry=1) at module/zfs/spa_misc.c:552
#8  0x0000e54ce2882748 in spa_vdev_state_enter (spa=spa@entry=0xaab8e7507d60, oplocks=oplocks@entry=0) at module/zfs/spa_misc.c:1424
#9  0x0000e54ce2838964 in dsl_scan (dp=0xaab8e762e080, func=<optimized out>, func@entry=POOL_SCAN_SCRUB, txgstart=txgstart@entry=0, txgend=txgend@entry=0) at module/zfs/dsl_scan.c:1018
#10 0x0000e54ce286a7a4 in spa_scan_range (spa=spa@entry=0xaab8e7507d60, func=func@entry=POOL_SCAN_SCRUB, txgstart=txgstart@entry=0, txgend=txgend@entry=0) at module/zfs/spa.c:9151
#11 0x0000e54ce286a8c0 in spa_scan (spa=spa@entry=0xaab8e7507d60, func=func@entry=POOL_SCAN_SCRUB) at module/zfs/spa.c:9118
#12 0x0000aab8aa68a480 in ztest_scrub_impl (spa=spa@entry=0xaab8e7507d60) at cmd/ztest.c:7412
#13 0x0000aab8aa68a560 in ztest_scrub (zd=<optimized out>, id=<optimized out>) at cmd/ztest.c:7449
#14 0x0000aab8aa68ae14 in ztest_execute (test=test@entry=19, zi=zi@entry=0xaab8aa6c03b8 <ztest_info+608>, id=id@entry=11) at cmd/ztest.c:8345
#15 0x0000aab8aa68e018 in ztest_thread (arg=arg@entry=0xb) at cmd/ztest.c:8530
#16 0x0000e54ce276514c in zk_thread_wrapper (arg=<optimized out>) at lib/libzpool/kernel.c:91
#17 0x0000e54ce24e595c in start_thread (arg=0xe54ce270c760) at ./nptl/pthread_create.c:447
#18 0x0000e54ce254ba4c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone3.S:76

4 The txg_sync_thread() continues running prune_candidates_sync(), eventually it hits zio_vdev_io_start() which decides to take SCL_ZIO reader lock. And having scl_write_wanted > 0 it is not going to happen, as the ztest_scrub thread actually waits for the pruning process to finish and free SCL_ZIO reader lock.

#0  0x0000e54ce24e1e9c in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0xaab8e7509ab0) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0xaab8e7509ab0) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0xaab8e7509ab0, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x0000e54ce24e4b20 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0xaab8e7509a40, cond=0xaab8e7509a88) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=cond@entry=0xaab8e7509a88, mutex=mutex@entry=0xaab8e7509a40) at ./nptl/pthread_cond_wait.c:627
#5  0x0000e54ce2765df8 in cv_wait (cv=cv@entry=0xaab8e7509a88, mp=mp@entry=0xaab8e7509a40) at lib/libzpool/kernel.c:349
#6  0x0000e54ce287ddc4 in spa_config_enter_impl (spa=spa@entry=0xaab8e7507d60, locks=locks@entry=16, tag=tag@entry=0xe54c200c8010, rw=rw@entry=0, mmp_flag=mmp_flag@entry=0) at module/zfs/spa_misc.c:532
#7  0x0000e54ce287e1e8 in spa_config_enter (spa=spa@entry=0xaab8e7507d60, locks=locks@entry=16, tag=tag@entry=0xe54c200c8010, rw=rw@entry=0) at module/zfs/spa_misc.c:552
#8  0x0000e54ce2900bbc in zio_vdev_io_start (zio=0xe54c200c8010) at module/zfs/zio.c:4575
#9  0x0000e54ce28fd638 in __zio_execute (zio=0xe54c200c8010) at module/zfs/zio.c:2497
#10 zio_nowait (zio=zio@entry=0xe54c200c8010) at module/zfs/zio.c:2604
#11 0x0000e54ce27a242c in arc_read (pio=pio@entry=0xe54c201ab810, spa=0xaab8e7507d60, bp=bp@entry=0xe54c052d8a48, done=done@entry=0xe54ce27bd840 <dbuf_read_done>, private=private@entry=0xe54c201263d0, priority=priority@entry=ZIO_PRIORITY_SYNC_READ, zio_flags=<optimized out>,
    zio_flags@entry=128, arc_flags=arc_flags@entry=0xe54c052d8a24, zb=zb@entry=0xe54c052d8a28) at module/zfs/arc.c:6422
#12 0x0000e54ce27bc1ec in dbuf_read_impl (db=db@entry=0xe54c201263d0, dn=dn@entry=0xaab8e7617010, zio=zio@entry=0xe54c201ab810, flags=flags@entry=(DB_RF_CANFAIL | DMU_READ_NO_PREFETCH), dblt=dblt@entry=DLT_PARENT, bp=<optimized out>, tag=0xe54ce29b56f8 <__func__.8>)
    at module/zfs/dbuf.c:1653
#13 0x0000e54ce27bc720 in dbuf_read (db=db@entry=0xe54c201263d0, pio=0xe54c201ab810, pio@entry=0x0, flags=flags@entry=(DB_RF_CANFAIL | DMU_READ_NO_PREFETCH)) at module/zfs/dbuf.c:1850
#14 0x0000e54ce27d12d4 in dmu_buf_hold_by_dnode (dn=<optimized out>, offset=offset@entry=32768, tag=tag@entry=0x0, dbp=dbp@entry=0xe54c052d8c20, flags=flags@entry=DMU_READ_NO_PREFETCH) at module/zfs/dmu.c:232
#15 0x0000e54ce28d4848 in zap_get_leaf_byblk (zap=zap@entry=0xaab8e76176e0, blkid=1, tx=tx@entry=0x0, lt=lt@entry=0, lp=lp@entry=0xe54c052d8ce8) at module/zfs/zap.c:555
#16 0x0000e54ce28d4d88 in zap_deref_leaf (zap=0xaab8e76176e0, h=16524418667674664960, tx=tx@entry=0x0, lt=lt@entry=0, lp=lp@entry=0xe54c052d8ce8) at module/zfs/zap.c:712
#17 0x0000e54ce28d68d8 in fzap_length (zn=zn@entry=0xe54c20019360, integer_size=integer_size@entry=0xe54c052d8dc8, num_integers=num_integers@entry=0xe54c052d8dd0) at module/zfs/zap.c:1041
#18 0x0000e54ce28dee88 in zap_length_uint64 (os=os@entry=0xaab8e762ac30, zapobj=zapobj@entry=263, key=key@entry=0xe54c201ab390, key_numints=key_numints@entry=5, integer_size=integer_size@entry=0xe54c052d8dc8, num_integers=num_integers@entry=0xe54c052d8dd0)
    at module/zfs/zap_micro.c:1444
#19 0x0000e54ce27cf424 in ddt_zap_lookup (os=0xaab8e762ac30, object=263, ddk=0xe54c201ab390, phys=0xe54c201ab420, psize=72) at module/zfs/ddt_zap.c:130
#20 0x0000e54ce27ca8f4 in ddt_lookup (ddt=ddt@entry=0xaab8e77119d0, bp=bp@entry=0xe54c052d9018, verify=verify@entry=B_TRUE) at module/zfs/ddt.c:1333
#21 0x0000e54ce27cacbc in prune_candidates_sync (arg=0xe54c0599b0c0, tx=<optimized out>) at module/zfs/ddt.c:2739
#22 0x0000e54ce283bfd4 in dsl_sync_task_sync (dst=0xe54c0599afb8, tx=tx@entry=0xe54c20002960) at module/zfs/dsl_synctask.c:256
#23 0x0000e54ce282a9e0 in dsl_pool_sync (dp=dp@entry=0xaab8e762e080, txg=txg@entry=68687) at module/zfs/dsl_pool.c:878
#24 0x0000e54ce2866004 in spa_sync_iterate_to_convergence (spa=spa@entry=0xaab8e7507d60, tx=tx@entry=0xe54c20002e90) at module/zfs/spa.c:10235
#25 0x0000e54ce286ab4c in spa_sync (spa=spa@entry=0xaab8e7507d60, txg=txg@entry=68687) at module/zfs/spa.c:10548
#26 0x0000e54ce2888f60 in txg_sync_thread (arg=arg@entry=0xaab8e762e080) at module/zfs/txg.c:604
#27 0x0000e54ce276514c in zk_thread_wrapper (arg=<optimized out>) at lib/libzpool/kernel.c:91
#28 0x0000e54ce24e595c in start_thread (arg=0xe54ce270c760) at ./nptl/pthread_create.c:447
#29 0x0000e54ce254ba4c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone3.S:76

Description

The mmp_flag of spa_config_enter_impl() is used to ignore pending write locks. And the condition for this is spa->spa_spa_active_ddt_prune flag.

As long as such change makes spa_config_enter_mmp() function have general application it is proposed to rename it to spa_config_enter_priority().

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@ihoro
Copy link
Contributor Author

ihoro commented Sep 26, 2025

The context is:

  • it was identified by ztest
  • according to the code it should happen only with a mirror VDEV
  • the actual cases seem to come from the intersection of running prune and in parallel requesting or doing something else that calls spa_vdev_state_enter()
  • we have the upcoming 2.4 release and the intuition is that it should include a solution for this
  • at the same time, the proposed change is a workaround, and presumably a proper fix needs revising the existing mechanisms; the intuition here is to avoid applying a non-trivial modification as a "last minute change" for the release

What could be the best fit in this context?

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the minimal targeted fix. We'll definitely want to revisit this and replace it with something cleaner, but I agree at least for now we can make this small change to resolve the deadlock.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Sep 26, 2025
Originally this was created for MMP, but now new cases are emerging
where the same mechanism is required. Hence the name's generalization.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Igor Ostapenko <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Igor Ostapenko <[email protected]>
@ihoro ihoro force-pushed the ddt-prune-scl-zio-deadlock branch from 54073e4 to ccdbf1c Compare September 29, 2025 11:00
@ihoro
Copy link
Contributor Author

ihoro commented Sep 29, 2025

FYI: I've split it onto two commits:

  • spa_config: Rename spa_config_enter_mmp() to spa_config_enter_priority()
  • ddt prune: Add SCL_ZIO deadlock workaround

It feels such way it must be easier to get back to this in the future.

@behlendorf behlendorf requested a review from amotin September 29, 2025 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants