Skip to content

Commit 0b4ff13

Browse files
boryaskdave
authored andcommitted
btrfs: try to search for data csums in commit root
If you run a workload with: - a cgroup that does tons of parallel data reading, with a working set much larger than its memory limit - a second cgroup that writes relatively fewer files, with overwrites, with no memory limit (see full code listing at the bottom for a reproducer) then what quickly occurs is: - we have a large number of threads trying to read the csum tree - we have a decent number of threads deleting csums running delayed refs - we have a large number of threads in direct reclaim and thus high memory pressure The result of this is that we writeback the csum tree repeatedly mid transaction, to get back the extent_buffer folios for reclaim. As a result, we repeatedly COW the csum tree for the delayed refs that are deleting csums. This means repeatedly write locking the higher levels of the tree. As a result of this, we achieve an unpleasant priority inversion. We have: - a high degree of contention on the csum root node (and other upper nodes) eb rwsem - a memory starved cgroup doing tons of reclaim on CPU. - many reader threads in the memory starved cgroup "holding" the sem as readers, but not scheduling promptly. i.e., task __state == 0, but not running on a cpu. - btrfs_commit_transaction stuck trying to acquire the sem as a writer. (running delayed_refs, deleting csums for unreferenced data extents) This results in arbitrarily long transactions. This then results in seriously degraded performance for any cgroup using the filesystem (the victim cgroup in the script). It isn't an academic problem, as we see this exact problem in production at Meta with one cgroup over its memory limit ruining btrfs performance for the whole system, stalling critical system services that depend on btrfs syncs. The underlying scheduling "problem" with global rwsems is sort of thorny and apparently well known and was discussed at LPC 2024, for example. As a result, our main lever in the short term is just trying to reduce contention on our various rwsems with an eye to reducing the frequency of write locking, to avoid disabling the read lock fast acquistion path. Luckily, it seems likely that many reads are for old extents written many transactions ago, and that for those we *can* in fact search the commit root. The commit_root_sem only gets taken write once, near the end of transaction commit, no matter how much memory pressure there is, so we have much less contention between readers and writers. This change detects when we are trying to read an old extent (according to extent map generation) and then wires that through bio_ctrl to the btrfs_bio, which unfortunately isn't allocated yet when we have this information. Luckily, we don't need this flag in the bio after submitting, so we can save space by setting it on bbio->bio.bi_flags and clear before submitting, so the block layer is unaffected. When we go to lookup the csums in lookup_bio_sums we can check this condition on the btrfs_bio and do the commit root lookup accordingly. Note that a single bio_ctrl might collect a few extent_maps into a single bio, so it is important to track a maximum generation across all the extent_maps used for each bio to make an accurate decision on whether it is valid to look in the commit root. If any extent_map is updated in the current generation, we can't use the commit root. To test and reproduce this issue, I used the following script and accompanying C program (to avoid bottlenecks in constantly forking thousands of dd processes): ====== big-read.c ====== #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <sys/stat.h> #include <unistd.h> #include <errno.h> #define BUF_SZ (128 * (1 << 10UL)) int read_once(int fd, size_t sz) { char buf[BUF_SZ]; size_t rd = 0; int ret = 0; while (rd < sz) { ret = read(fd, buf, BUF_SZ); if (ret < 0) { if (errno == EINTR) continue; fprintf(stderr, "read failed: %d\n", errno); return -errno; } else if (ret == 0) { break; } else { rd += ret; } } return rd; } int read_loop(char *fname) { int fd; struct stat st; size_t sz = 0; int ret; while (1) { fd = open(fname, O_RDONLY); if (fd == -1) { perror("open"); return 1; } if (!sz) { if (!fstat(fd, &st)) { sz = st.st_size; } else { perror("stat"); return 1; } } ret = read_once(fd, sz); close(fd); } } int main(int argc, char *argv[]) { int fd; struct stat st; off_t sz; char *buf; int ret; if (argc != 2) { fprintf(stderr, "Usage: %s <filename>\n", argv[0]); return 1; } return read_loop(argv[1]); } ====== repro.sh ====== #!/usr/bin/env bash SCRIPT=$(readlink -f "$0") DIR=$(dirname "$SCRIPT") dev=$1 mnt=$2 shift shift CG_ROOT=/sys/fs/cgroup BAD_CG=$CG_ROOT/bad-nbr GOOD_CG=$CG_ROOT/good-nbr NR_BIGGOS=1 NR_LITTLE=10 NR_VICTIMS=32 NR_VILLAINS=512 START_SEC=$(date +%s) _elapsed() { echo "elapsed: $(($(date +%s) - $START_SEC))" } _stats() { local sysfs=/sys/fs/btrfs/$(findmnt -no UUID $dev) echo "================" date _elapsed cat $sysfs/commit_stats cat $BAD_CG/memory.pressure } _setup_cgs() { echo "+memory +cpuset" > $CG_ROOT/cgroup.subtree_control mkdir -p $GOOD_CG mkdir -p $BAD_CG echo max > $BAD_CG/memory.max # memory.high much less than the working set will cause heavy reclaim echo $((1 << 30)) > $BAD_CG/memory.high # victims get a subset of villain CPUs echo 0 > $GOOD_CG/cpuset.cpus echo 0,1,2,3 > $BAD_CG/cpuset.cpus } _kill_cg() { local cg=$1 local attempts=0 echo "kill cgroup $cg" [ -f $cg/cgroup.procs ] || return while true; do attempts=$((attempts + 1)) echo 1 > $cg/cgroup.kill sleep 1 procs=$(wc -l $cg/cgroup.procs | cut -d' ' -f1) [ $procs -eq 0 ] && break done rmdir $cg echo "killed cgroup $cg in $attempts attempts" } _biggo_vol() { echo $mnt/biggo_vol.$1 } _biggo_file() { echo $(_biggo_vol $1)/biggo } _subvoled_biggos() { total_sz=$((10 << 30)) per_sz=$((total_sz / $NR_VILLAINS)) dd_count=$((per_sz >> 20)) echo "create $NR_VILLAINS subvols with a file of size $per_sz bytes for a total of $total_sz bytes." for i in $(seq $NR_VILLAINS) do btrfs subvol create $(_biggo_vol $i) &>/dev/null dd if=/dev/zero of=$(_biggo_file $i) bs=1M count=$dd_count &>/dev/null done echo "done creating subvols." } _setup() { [ -f .done ] && rm .done findmnt -n $dev && exit 1 if [ -f .re-mkfs ]; then mkfs.btrfs -f -m single -d single $dev >/dev/null || exit 2 else echo "touch .re-mkfs to populate the test fs" fi mount -o noatime $dev $mnt || exit 3 [ -f .re-mkfs ] && _subvoled_biggos _setup_cgs } _my_cleanup() { echo "CLEANUP!" _kill_cg $BAD_CG _kill_cg $GOOD_CG sleep 1 umount $mnt } _bad_exit() { _err "Unexpected Exit! $?" _stats exit $? } trap _my_cleanup EXIT trap _bad_exit INT TERM _setup # Use a lot of page cache reading the big file _villain() { local i=$1 echo $BASHPID > $BAD_CG/cgroup.procs $DIR/big-read $(_biggo_file $i) } # Hit del_csum a lot by overwriting lots of small new files _victim() { echo $BASHPID > $GOOD_CG/cgroup.procs i=0; while (true) do local tmp=$mnt/tmp.$i dd if=/dev/zero of=$tmp bs=4k count=2 >/dev/null 2>&1 i=$((i+1)) [ $i -eq $NR_LITTLE ] && i=0 done } _one_sync() { echo "sync..." before=$(date +%s) sync after=$(date +%s) echo "sync done in $((after - before))s" _stats } # sync in a loop _sync() { echo "start sync loop" syncs=0 echo $BASHPID > $GOOD_CG/cgroup.procs while true do [ -f .done ] && break _one_sync syncs=$((syncs + 1)) [ -f .done ] && break sleep 10 done if [ $syncs -eq 0 ]; then echo "do at least one sync!" _one_sync fi echo "sync loop done." } _sleep() { local time=${1-60} local now=$(date +%s) local end=$((now + time)) while [ $now -lt $end ]; do echo "SLEEP: $((end - now))s left. Sleep 10." sleep 10 now=$(date +%s) done } echo "start $NR_VILLAINS villains" for i in $(seq $NR_VILLAINS) do _villain $i & disown # get rid of annoying log on kill (done via cgroup anyway) done echo "start $NR_VICTIMS victims" for i in $(seq $NR_VICTIMS) do _victim & disown done _sync & SYNC_PID=$! _sleep $1 _elapsed touch .done wait $SYNC_PID echo "OK" exit 0 Without this patch, that reproducer: - Ran for 6+ minutes instead of 60s - Hung hundreds of threads in D state on the csum reader lock - Got a commit stuck for 3 minutes sync done in 388s ================ Wed Jul 9 09:52:31 PM UTC 2025 elapsed: 420 commits 2 cur_commit_ms 0 last_commit_ms 159446 max_commit_ms 159446 total_commit_ms 160058 some avg10=99.03 avg60=98.97 avg300=75.43 total=418033386 full avg10=82.79 avg60=80.52 avg300=59.45 total=324995274 419 hits state R, D comms big-read btrfs_tree_read_lock_nested btrfs_read_lock_root_node btrfs_search_slot btrfs_lookup_csum btrfs_lookup_bio_sums btrfs_submit_bbio 1 hits state D comms btrfs-transacti btrfs_tree_lock_nested btrfs_lock_root_node btrfs_search_slot btrfs_del_csums __btrfs_run_delayed_refs btrfs_run_delayed_refs With the patch, the reproducer exits naturally, in 65s, completing a pretty decent 4 commits, despite heavy memory pressure. Occasionally you can still trigger a rather long commit (couple seconds) but never one that is minutes long. sync done in 3s ================ elapsed: 65 commits 4 cur_commit_ms 0 last_commit_ms 485 max_commit_ms 689 total_commit_ms 2453 some avg10=98.28 avg60=64.54 avg300=19.39 total=64849893 full avg10=74.43 avg60=48.50 avg300=14.53 total=48665168 some random rwalker samples showed the most common stack in reclaim, rather than the csum tree: 145 hits state R comms bash, sleep, dd, shuf shrink_folio_list shrink_lruvec shrink_node do_try_to_free_pages try_to_free_mem_cgroup_pages reclaim_high Link: https://lpc.events/event/18/contributions/1883/ Signed-off-by: Boris Burkov <[email protected]> Signed-off-by: David Sterba <[email protected]>
1 parent b60e5b4 commit 0b4ff13

File tree

5 files changed

+96
-0
lines changed

5 files changed

+96
-0
lines changed

fs/btrfs/bio.c

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,8 @@ static struct btrfs_bio *btrfs_split_bio(struct btrfs_fs_info *fs_info,
9393
refcount_inc(&orig_bbio->ordered->refs);
9494
bbio->ordered = orig_bbio->ordered;
9595
}
96+
if (btrfs_bio_csum_search_commit_root(orig_bbio))
97+
btrfs_bio_set_csum_search_commit_root(bbio);
9698
atomic_inc(&orig_bbio->pending_ios);
9799
return bbio;
98100
}
@@ -479,6 +481,14 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
479481
static void btrfs_submit_bio(struct bio *bio, struct btrfs_io_context *bioc,
480482
struct btrfs_io_stripe *smap, int mirror_num)
481483
{
484+
/*
485+
* It is important to clear the bits we used in bio->bi_flags.
486+
* Because bio->bi_flags belongs to the block layer, we should
487+
* avoid leaving stray bits set when we transfer ownership of
488+
* the bio by submitting it.
489+
*/
490+
btrfs_bio_clear_csum_search_commit_root(btrfs_bio(bio));
491+
482492
if (!bioc) {
483493
/* Single mirror read/write fast path. */
484494
btrfs_bio(bio)->mirror_num = mirror_num;

fs/btrfs/bio.h

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,23 @@ struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
104104
btrfs_bio_end_io_t end_io, void *private);
105105
void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status);
106106

107+
#define BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT (1U << (BIO_FLAG_LAST + 1))
108+
109+
static inline void btrfs_bio_set_csum_search_commit_root(struct btrfs_bio *bbio)
110+
{
111+
bbio->bio.bi_flags |= BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT;
112+
}
113+
114+
static inline void btrfs_bio_clear_csum_search_commit_root(struct btrfs_bio *bbio)
115+
{
116+
bbio->bio.bi_flags &= ~BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT;
117+
}
118+
119+
static inline bool btrfs_bio_csum_search_commit_root(const struct btrfs_bio *bbio)
120+
{
121+
return bbio->bio.bi_flags & BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT;
122+
}
123+
107124
/* Submit using blkcg_punt_bio_submit. */
108125
#define REQ_BTRFS_CGROUP_PUNT REQ_FS_PRIVATE
109126

fs/btrfs/compression.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -602,6 +602,8 @@ void btrfs_submit_compressed_read(struct btrfs_bio *bbio)
602602
cb->compressed_len = compressed_len;
603603
cb->compress_type = btrfs_extent_map_compression(em);
604604
cb->orig_bbio = bbio;
605+
if (btrfs_bio_csum_search_commit_root(bbio))
606+
btrfs_bio_set_csum_search_commit_root(&cb->bbio);
605607

606608
btrfs_free_extent_map(em);
607609

fs/btrfs/extent_io.c

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,16 @@ struct btrfs_bio_ctrl {
101101
enum btrfs_compression_type compress_type;
102102
u32 len_to_oe_boundary;
103103
blk_opf_t opf;
104+
/*
105+
* For data read bios, we attempt to optimize csum lookups if the extent
106+
* generation is older than the current one. To make this possible, we
107+
* need to track the maximum generation of an extent in a bio_ctrl to
108+
* make the decision when submitting the bio.
109+
*
110+
* See the comment in btrfs_lookup_bio_sums for more detail on the
111+
* need for this optimization.
112+
*/
113+
u64 generation;
104114
btrfs_bio_end_io_t end_io_func;
105115
struct writeback_control *wbc;
106116

@@ -113,6 +123,28 @@ struct btrfs_bio_ctrl {
113123
struct readahead_control *ractl;
114124
};
115125

126+
/*
127+
* Helper to set the csum search commit root option for a bio_ctrl's bbio
128+
* before submitting the bio.
129+
*
130+
* Only for use by submit_one_bio().
131+
*/
132+
static void bio_set_csum_search_commit_root(struct btrfs_bio_ctrl *bio_ctrl)
133+
{
134+
struct btrfs_bio *bbio = bio_ctrl->bbio;
135+
136+
ASSERT(bbio);
137+
138+
if (!(btrfs_op(&bbio->bio) == BTRFS_MAP_READ && is_data_inode(bbio->inode)))
139+
return;
140+
141+
if (bio_ctrl->generation &&
142+
bio_ctrl->generation < btrfs_get_fs_generation(bbio->inode->root->fs_info))
143+
btrfs_bio_set_csum_search_commit_root(bio_ctrl->bbio);
144+
else
145+
btrfs_bio_clear_csum_search_commit_root(bio_ctrl->bbio);
146+
}
147+
116148
static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
117149
{
118150
struct btrfs_bio *bbio = bio_ctrl->bbio;
@@ -123,6 +155,8 @@ static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
123155
/* Caller should ensure the bio has at least some range added */
124156
ASSERT(bbio->bio.bi_iter.bi_size);
125157

158+
bio_set_csum_search_commit_root(bio_ctrl);
159+
126160
if (btrfs_op(&bbio->bio) == BTRFS_MAP_READ &&
127161
bio_ctrl->compress_type != BTRFS_COMPRESS_NONE)
128162
btrfs_submit_compressed_read(bbio);
@@ -131,6 +165,8 @@ static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
131165

132166
/* The bbio is owned by the end_io handler now */
133167
bio_ctrl->bbio = NULL;
168+
/* Reset the generation for the next bio */
169+
bio_ctrl->generation = 0;
134170
}
135171

136172
/*
@@ -1026,6 +1062,8 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
10261062
if (prev_em_start)
10271063
*prev_em_start = em->start;
10281064

1065+
bio_ctrl->generation = max(bio_ctrl->generation, em->generation);
1066+
10291067
btrfs_free_extent_map(em);
10301068
em = NULL;
10311069

fs/btrfs/file-item.c

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -397,6 +397,33 @@ int btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
397397
path->skip_locking = 1;
398398
}
399399

400+
/*
401+
* If we are searching for a csum of an extent from a past
402+
* transaction, we can search in the commit root and reduce
403+
* lock contention on the csum tree root node's extent buffer.
404+
*
405+
* This is important because that lock is an rwswem which gets
406+
* pretty heavy write load, unlike the commit_root_sem.
407+
*
408+
* Due to how rwsem is implemented, there is a possible
409+
* priority inversion where the readers holding the lock don't
410+
* get scheduled (say they're in a cgroup stuck in heavy reclaim)
411+
* which then blocks writers, including transaction commit. By
412+
* using a semaphore with fewer writers (only a commit switching
413+
* the roots), we make this issue less likely.
414+
*
415+
* Note that we don't rely on btrfs_search_slot to lock the
416+
* commit root csum. We call search_slot multiple times, which would
417+
* create a potential race where a commit comes in between searches
418+
* while we are not holding the commit_root_sem, and we get csums
419+
* from across transactions.
420+
*/
421+
if (btrfs_bio_csum_search_commit_root(bbio)) {
422+
path->search_commit_root = 1;
423+
path->skip_locking = 1;
424+
down_read(&fs_info->commit_root_sem);
425+
}
426+
400427
while (bio_offset < orig_len) {
401428
int count;
402429
u64 cur_disk_bytenr = orig_disk_bytenr + bio_offset;
@@ -442,6 +469,8 @@ int btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
442469
bio_offset += count * sectorsize;
443470
}
444471

472+
if (btrfs_bio_csum_search_commit_root(bbio))
473+
up_read(&fs_info->commit_root_sem);
445474
return ret;
446475
}
447476

0 commit comments

Comments
 (0)