Skip to content

Be more careful with locking db.db_mtx #17418

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

asomers
Copy link
Contributor

@asomers asomers commented Jun 3, 2025

Lock db->db_mtx in some places that access db->db_data. But don't lock it in free_children, even though it does access db->db_data, because that leads to a recurse-on-non-recursive panic.

Lock db->db_rwlock in some places that access db->db.db_data's contents.

Closes #16626
Sponsored by: ConnectWise

Motivation and Context

Fixes occasional in-memory corruption which is usually manifested as a panic with a message like "blkptr XXX has invalid XXX" or "blkptr XXX has no valid DVAs". I suspect that some on-disk corruption bugs have been caused by this same root cause, too.

Description

Always lock dmu_buf_impl_t.db_mtx in places that access the value of dmu_buf_impl_t.db->db_data. And always lockdmu_buf_impl_t.db_rwlock in places that access the contents of dmu_buf_impl_t.db->db_rwlock.

Note that free_children still violates these rules. It can't easily be fixed without causing other problems. A proper fix is left for the future.

How Has This Been Tested?

I cannot reproduce the bug on command, so I had to rely on statistics to validate the patch.

  • Since the beginning of 2025, servers running the vulnerable workload on FreeBSD 14.1 without this patch have crashed with a probability of 0.34% per server per day. The distribution of crashes fits a Poisson distribution, suggesting that each crash is random and independent. That is, a server that's already crashed once is no more likely to crash in the future than one which hasn't crashed yet.
  • Servers running the vulnerable workload on FreeBSD 14.2 with this patch have accumulated a total of 1301 days of uptime with no crashes. So I conclude with 98.8% confidence that the 14.2 upgrade combined with the patch is effective.
  • Servers running the vulnerable workload on FreeBSD 14.2 without the patch are too few to draw conclusions about. But I don't see any related changes in the diff between 14.1 and 14.2. So I think that the patch is responsible for the cessation of crashes, not the upgrade.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Contributor

@alek-p alek-p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already reviewed this internally, and as the PR description states, we've had a good experience running with this patch for the last couple of months

@amotin
Copy link
Member

amotin commented Jun 4, 2025

As I see, in most of cases (I've spotted only one different) when you are taking db_rwlock, you also take db_mtx. It makes no sense to me, unless the only few exceptions are enormously expensive or otherwise don't allow db_mtx to be taken. I feel like we need some better understanding of locking strategy. At least I do.

@snajpa
Copy link
Contributor

snajpa commented Jun 4, 2025

FWIW, as we're discussing here, I even think - after all the staring at the code - that the locking itself is actually fine, it seems to be a result of optimizations exactly because things don't need to be overlocked if it's guaranteed to be OK via other logical dependencies.

I think I have actually nailed where the problem is, but @asomers says he can't try it :)

@asomers
Copy link
Contributor Author

asomers commented Jun 4, 2025

As I see, in most of cases (I've spotted only one different) when you are taking db_rwlock, you also take db_mtx. It makes no sense to me, unless the only few exceptions are enormously expensive or otherwise don't allow db_mtx to be taken. I feel like we need some better understanding of locking strategy. At least I do.

That's because of this comment from @pcd1193182: "So the subtlety here is that the value of the db.db_data and db_buf fields are, I believe, still protected by the db_mtx plus the db_holds refcount. The contents of the buffers are protected by the db_rwlock." So many places need both db_mtx and db_rwlock. Some need only the former. I don't know of any cases where code would only need the latter.

@snajpa
Copy link
Contributor

snajpa commented Jun 4, 2025

I'm sorry, I mixed it up. This is definitely needed and then there's a bug with dbuf resize. Two different things.

Lock db_mtx in some places that access db->db_data.  But don't lock
it in free_children, even though it does access db->db_data, because
that leads to a recurse-on-non-recursive panic.

Lock db_rwlock in some places that access db->db.db_data's contents.

Closes	openzfs#16626
Sponsored by:	ConnectWise
Signed-off-by: Alan Somers <[email protected]>
@satmandu
Copy link
Contributor

satmandu commented Aug 12, 2025

@asomers Are you still awaiting reviewers on this? I've been running with the changes from this PR without any issues for a while now. It would be nice to get in all the "prevents corruption" PRs before 2.4.0.

@satmandu satmandu mentioned this pull request Aug 12, 2025
14 tasks
@clhedrick
Copy link

Does this apply to 2.2.8 also?

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I went through all this, and I believe most of locking is not neded -- se below. Only few I've left uncommented.

@@ -1193,7 +1193,8 @@ dbuf_verify(dmu_buf_impl_t *db)
if ((db->db_blkptr == NULL || BP_IS_HOLE(db->db_blkptr)) &&
(db->db_buf == NULL || db->db_buf->b_data) &&
db->db.db_data && db->db_blkid != DMU_BONUS_BLKID &&
db->db_state != DB_FILL && (dn == NULL || !dn->dn_free_txg)) {
db->db_state != DB_FILL && (dn == NULL || !dn->dn_free_txg) &&
RW_LOCK_HELD(&db->db_rwlock)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this condition is needed (or even correct). If the db_dirtycnt below is zero (and it should be protected by db_mtx), then the buffer must be empty.

memcpy(dr->dt.dl.dr_data, db->db.db_data, bonuslen);
rw_exit(&db->db_rwlock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not about indirects or meta-dnode. db_rwlock didn't promise to protect every dbuf.

memcpy(dr->dt.dl.dr_data->b_data, db->db.db_data, size);
rw_exit(&db->db_rwlock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is not about indirects or meta-dnode, since those are never "fixed". db_rwlock didn't promise to protect every dbuf.

memset(db->db.db_data, 0, db->db.db_size);
rw_exit(&db->db_rwlock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, not indirect or meta-dnode.

*bpp = ((blkptr_t *)(*parentp)->db.db_data) +
(blkid & ((1ULL << epbs) - 1));
mutex_exit(&(*parentp)->db_mtx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indirect dbufs are never relocated. These pointers are constant.

Comment on lines +1623 to +1626
rw_enter(&db->db_rwlock, RW_READER);
dn = dnode_create(os, dn_block + idx, db,
object, dnh);
rw_exit(&db->db_rwlock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I object this, but who may dereference this before we return?

Comment on lines +1703 to +1706
rw_enter(&db->db_rwlock, RW_READER);
dn = dnode_create(os, dn_block + idx, db,
object, dnh);
rw_exit(&db->db_rwlock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I object this, but who may dereference this before we return?

@@ -2588,6 +2614,7 @@ dnode_next_offset_level(dnode_t *dn, int flags, uint64_t *offset,
dbuf_rele(db, FTAG);
return (error);
}
mutex_enter(&db->db_mtx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meta-dnode and indirect dbufs are not relocatable.

@@ -79,6 +79,7 @@ dnode_increase_indirection(dnode_t *dn, dmu_tx_t *tx)
(void) dbuf_read(db, NULL, DB_RF_MUST_SUCCEED|DB_RF_HAVESTRUCT);
if (dn->dn_dbuf != NULL)
rw_enter(&dn->dn_dbuf->db_rwlock, RW_WRITER);
mutex_enter(&db->db_mtx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meta-dnode and indirect dbufs are not relocatable. Do we need this here?

@@ -233,6 +235,7 @@ free_verify(dmu_buf_impl_t *db, uint64_t start, uint64_t end, dmu_tx_t *tx)
* future txg.
*/
mutex_enter(&child->db_mtx);
rw_enter(&child->db_rwlock, RW_READER);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, children are L0 here, so db_rwlock didn't promise to protect them.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Aug 13, 2025
@asomers
Copy link
Contributor Author

asomers commented Aug 18, 2025

Though I see your comments, @amotin , I still struggle to understand the right thing to do, generally, because the locking requirements aren't well documented, nor are they enforced either by the compiler or at runtime. Here are the different descriptions I've seen:

From dbuf.h:

db.db_data, which is protected by db_mtx
...
[db_rwlock] Protects db_buf's contents if they contain an indirect block or data block of the meta-dnode

And here's what @pcd1193182 said in #17118 👍

The value of the `db.db_data` and `db_buf` fields
are protected by `db_mtx` plus the `db_holds` refcount.  The contents are
protected by `db_rwlock`.  `db_mtx` is also responsible for protecting some of
the other parts of the dbuf state.

And later

dbufs have different states,and when they are in these different states, they can only be accessed in
certain ways.

But I don't see any list of what the various states are, nor how to tell which state a dbuf is in.

@amotin added the following in that same discussion thread:

db_rwlock protect content of buffers that are parent (indirect or dnode) of
some other buffer, and we need to either write or read the block pointer of the
buffer, either directly or via de-referencing the pointer of db_blkptr pointing
inside it. All the parent buffers permanently referenced so can not be evicted,
and have only one copy, so their memory should never be reallocated, so db_mtx
protection is not required in this case.

And @amotin added some more detail in this PR:

  • "If the db_dirtycnt below is zero (and it should be protected by db_mtx), then the buffer must be empty."
  • "Indirects don't relocate."
  • "meta-dnode dbufs are not relocatable"
  • "db_rwlock didn't promise to protect [L0 blocks]"

I can't confidently make any changes here without a complete and accurate description of the locking comments. What I need are:

  • Complete and accurate documentation in dbuf.h
  • A way to enforce those requirements at runtime. Perhaps a macro that asserts that a db_buf is locked, or else doesn't need to be locked based on other data in the dmu_buf_impl, and can be called everywhere that db_buf is accessed. And a similar macro for db.db_data.

@amotin can you please help with that? At least with the first part?

@amotin
Copy link
Member

amotin commented Aug 19, 2025

@asomers Let me rephrase the key points:

  • Indirects and L0 dnode dbufs are special in having only one data copy ever. They are always decompressed in memory, and if need do be decrypted (only bonus parts of dnode L0 can be encrypted, indirects are only signed), then it is done in place. It means they are never relocated in memory, so we don't need db_mtx to protect their db.db_data. And as long as we hold a reference on those dbufs, they can not be evicted and so change their state. This removes most of db_mtx acquisitions you've added.
  • db_rwlock is designed to protect specifically indirects and L0 dnode blocks from torn writes when they are modified by sync context, but read by anything else. db_rwlock is not intended to protect any user data dbufs, modified only in open context. For those we have range locks, etc. This removes most of db_rwlock acquisitions you've added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Occasional panics with "blkptr at XXX has invalid YYY"
7 participants