-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Be more careful with locking db.db_mtx #17418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've already reviewed this internally, and as the PR description states, we've had a good experience running with this patch for the last couple of months
As I see, in most of cases (I've spotted only one different) when you are taking |
FWIW, as we're discussing here, I even think - after all the staring at the code - that the locking itself is actually fine, it seems to be a result of optimizations exactly because things don't need to be overlocked if it's guaranteed to be OK via other logical dependencies. I think I have actually nailed where the problem is, but @asomers says he can't try it :) |
That's because of this comment from @pcd1193182: "So the subtlety here is that the value of the db.db_data and db_buf fields are, I believe, still protected by the db_mtx plus the db_holds refcount. The contents of the buffers are protected by the db_rwlock." So many places need both |
I'm sorry, I mixed it up. This is definitely needed and then there's a bug with dbuf resize. Two different things. |
Lock db_mtx in some places that access db->db_data. But don't lock it in free_children, even though it does access db->db_data, because that leads to a recurse-on-non-recursive panic. Lock db_rwlock in some places that access db->db.db_data's contents. Closes openzfs#16626 Sponsored by: ConnectWise Signed-off-by: Alan Somers <[email protected]>
@asomers Are you still awaiting reviewers on this? I've been running with the changes from this PR without any issues for a while now. It would be nice to get in all the "prevents corruption" PRs before 2.4.0. |
Does this apply to 2.2.8 also? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I went through all this, and I believe most of locking is not neded -- se below. Only few I've left uncommented.
@@ -1193,7 +1193,8 @@ dbuf_verify(dmu_buf_impl_t *db) | |||
if ((db->db_blkptr == NULL || BP_IS_HOLE(db->db_blkptr)) && | |||
(db->db_buf == NULL || db->db_buf->b_data) && | |||
db->db.db_data && db->db_blkid != DMU_BONUS_BLKID && | |||
db->db_state != DB_FILL && (dn == NULL || !dn->dn_free_txg)) { | |||
db->db_state != DB_FILL && (dn == NULL || !dn->dn_free_txg) && | |||
RW_LOCK_HELD(&db->db_rwlock)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe this condition is needed (or even correct). If the db_dirtycnt
below is zero (and it should be protected by db_mtx
), then the buffer must be empty.
memcpy(dr->dt.dl.dr_data, db->db.db_data, bonuslen); | ||
rw_exit(&db->db_rwlock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not about indirects or meta-dnode. db_rwlock
didn't promise to protect every dbuf.
memcpy(dr->dt.dl.dr_data->b_data, db->db.db_data, size); | ||
rw_exit(&db->db_rwlock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this is not about indirects or meta-dnode, since those are never "fixed". db_rwlock
didn't promise to protect every dbuf.
memset(db->db.db_data, 0, db->db.db_size); | ||
rw_exit(&db->db_rwlock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, not indirect or meta-dnode.
*bpp = ((blkptr_t *)(*parentp)->db.db_data) + | ||
(blkid & ((1ULL << epbs) - 1)); | ||
mutex_exit(&(*parentp)->db_mtx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indirect dbufs are never relocated. These pointers are constant.
rw_enter(&db->db_rwlock, RW_READER); | ||
dn = dnode_create(os, dn_block + idx, db, | ||
object, dnh); | ||
rw_exit(&db->db_rwlock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I object this, but who may dereference this before we return?
rw_enter(&db->db_rwlock, RW_READER); | ||
dn = dnode_create(os, dn_block + idx, db, | ||
object, dnh); | ||
rw_exit(&db->db_rwlock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I object this, but who may dereference this before we return?
@@ -2588,6 +2614,7 @@ dnode_next_offset_level(dnode_t *dn, int flags, uint64_t *offset, | |||
dbuf_rele(db, FTAG); | |||
return (error); | |||
} | |||
mutex_enter(&db->db_mtx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meta-dnode and indirect dbufs are not relocatable.
@@ -79,6 +79,7 @@ dnode_increase_indirection(dnode_t *dn, dmu_tx_t *tx) | |||
(void) dbuf_read(db, NULL, DB_RF_MUST_SUCCEED|DB_RF_HAVESTRUCT); | |||
if (dn->dn_dbuf != NULL) | |||
rw_enter(&dn->dn_dbuf->db_rwlock, RW_WRITER); | |||
mutex_enter(&db->db_mtx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meta-dnode and indirect dbufs are not relocatable. Do we need this here?
@@ -233,6 +235,7 @@ free_verify(dmu_buf_impl_t *db, uint64_t start, uint64_t end, dmu_tx_t *tx) | |||
* future txg. | |||
*/ | |||
mutex_enter(&child->db_mtx); | |||
rw_enter(&child->db_rwlock, RW_READER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, children are L0 here, so db_rwlock
didn't promise to protect them.
Though I see your comments, @amotin , I still struggle to understand the right thing to do, generally, because the locking requirements aren't well documented, nor are they enforced either by the compiler or at runtime. Here are the different descriptions I've seen: From dbuf.h:
And here's what @pcd1193182 said in #17118 👍
And later
But I don't see any list of what the various states are, nor how to tell which state a dbuf is in. @amotin added the following in that same discussion thread:
And @amotin added some more detail in this PR:
I can't confidently make any changes here without a complete and accurate description of the locking comments. What I need are:
@amotin can you please help with that? At least with the first part? |
@asomers Let me rephrase the key points:
|
Lock db->db_mtx in some places that access db->db_data. But don't lock it in free_children, even though it does access db->db_data, because that leads to a recurse-on-non-recursive panic.
Lock db->db_rwlock in some places that access db->db.db_data's contents.
Closes #16626
Sponsored by: ConnectWise
Motivation and Context
Fixes occasional in-memory corruption which is usually manifested as a panic with a message like "blkptr XXX has invalid XXX" or "blkptr XXX has no valid DVAs". I suspect that some on-disk corruption bugs have been caused by this same root cause, too.
Description
Always lock
dmu_buf_impl_t.db_mtx
in places that access the value ofdmu_buf_impl_t.db->db_data
. And always lockdmu_buf_impl_t.db_rwlock
in places that access the contents ofdmu_buf_impl_t.db->db_rwlock
.Note that
free_children
still violates these rules. It can't easily be fixed without causing other problems. A proper fix is left for the future.How Has This Been Tested?
I cannot reproduce the bug on command, so I had to rely on statistics to validate the patch.
Types of changes
Checklist:
Signed-off-by
.