Skip to content

Commit 4385489

Browse files
committed
page_pool: Track DMA-mapped pages and unmap them when destroying the pool
jira LE-3526 Rebuild_History Non-Buildable kernel-6.12.0-55.20.1.el10_0 commit-author Toke Høiland-Jørgensen <[email protected]> commit ee62ce7 When enabling DMA mapping in page_pool, pages are kept DMA mapped until they are released from the pool, to avoid the overhead of re-mapping the pages every time they are used. This causes resource leaks and/or crashes when there are pages still outstanding while the device is torn down, because page_pool will attempt an unmap through a non-existent DMA device on the subsequent page return. To fix this, implement a simple tracking of outstanding DMA-mapped pages in page pool using an xarray. This was first suggested by Mina[0], and turns out to be fairly straight forward: We simply store pointers to pages directly in the xarray with xa_alloc() when they are first DMA mapped, and remove them from the array on unmap. Then, when a page pool is torn down, it can simply walk the xarray and unmap all pages still present there before returning, which also allows us to get rid of the get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional synchronisation is needed, as a page will only ever be unmapped once. To avoid having to walk the entire xarray on unmap to find the page reference, we stash the ID assigned by xa_alloc() into the page structure itself, using the upper bits of the pp_magic field. This requires a couple of defines to avoid conflicting with the POINTER_POISON_DELTA define, but this is all evaluated at compile-time, so does not affect run-time performance. The bitmap calculations in this patch gives the following number of bits for different architectures: - 23 bits on 32-bit architectures - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE) - 32 bits on other 64-bit architectures Stashing a value into the unused bits of pp_magic does have the effect that it can make the value stored there lie outside the unmappable range (as governed by the mmap_min_addr sysctl), for architectures that don't define ILLEGAL_POINTER_VALUE. This means that if one of the pointers that is aliased to the pp_magic field (such as page->lru.next) is dereferenced while the page is owned by page_pool, that could lead to a dereference into userspace, which is a security concern. The risk of this is mitigated by the fact that (a) we always clear pp_magic before releasing a page from page_pool, and (b) this would need a use-after-free bug for struct page, which can have many other risks since page->lru.next is used as a generic list pointer in multiple places in the kernel. As such, with this patch we take the position that this risk is negligible in practice. For more discussion, see[1]. Since all the tracking added in this patch is performed on DMA map/unmap, no additional code is needed in the fast path, meaning the performance overhead of this tracking is negligible there. A micro-benchmark shows that the total overhead of the tracking itself is about 400 ns (39 cycles(tsc) 395.218 ns; sum for both map and unmap[2]). Since this cost is only paid on DMA map and unmap, it seems like an acceptable cost to fix the late unmap issue. Further optimisation can narrow the cases where this cost is paid (for instance by eliding the tracking when DMA map/unmap is a no-op). The extra memory needed to track the pages is neatly encapsulated inside xarray, which uses the 'struct xa_node' structure to track items. This structure is 576 bytes long, with slots for 64 items, meaning that a full node occurs only 9 bytes of overhead per slot it tracks (in practice, it probably won't be this efficient, but in any case it should be an acceptable overhead). [0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/ [1] https://lore.kernel.org/r/[email protected] [2] https://lore.kernel.org/r/[email protected] Reported-by: Yonglong Liu <[email protected]> Closes: https://lore.kernel.org/r/[email protected] Fixes: ff7d6b2 ("page_pool: refurbish version of page_pool code") Suggested-by: Mina Almasry <[email protected]> Reviewed-by: Mina Almasry <[email protected]> Reviewed-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Qiuling Ren <[email protected]> Tested-by: Yuying Ma <[email protected]> Tested-by: Yonglong Liu <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> (cherry picked from commit ee62ce7) Signed-off-by: Jonathan Maple <[email protected]>
1 parent 959fe9b commit 4385489

File tree

5 files changed

+147
-18
lines changed

5 files changed

+147
-18
lines changed

include/linux/mm.h

Lines changed: 42 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4224,13 +4224,51 @@ static inline void pgalloc_tag_copy(struct folio *new, struct folio *old)
42244224
}
42254225
#endif /* CONFIG_MEM_ALLOC_PROFILING */
42264226

4227+
/*
4228+
* DMA mapping IDs for page_pool
4229+
*
4230+
* When DMA-mapping a page, page_pool allocates an ID (from an xarray) and
4231+
* stashes it in the upper bits of page->pp_magic. We always want to be able to
4232+
* unambiguously identify page pool pages (using page_pool_page_is_pp()). Non-PP
4233+
* pages can have arbitrary kernel pointers stored in the same field as pp_magic
4234+
* (since it overlaps with page->lru.next), so we must ensure that we cannot
4235+
* mistake a valid kernel pointer with any of the values we write into this
4236+
* field.
4237+
*
4238+
* On architectures that set POISON_POINTER_DELTA, this is already ensured,
4239+
* since this value becomes part of PP_SIGNATURE; meaning we can just use the
4240+
* space between the PP_SIGNATURE value (without POISON_POINTER_DELTA), and the
4241+
* lowest bits of POISON_POINTER_DELTA. On arches where POISON_POINTER_DELTA is
4242+
* 0, we make sure that we leave the two topmost bits empty, as that guarantees
4243+
* we won't mistake a valid kernel pointer for a value we set, regardless of the
4244+
* VMSPLIT setting.
4245+
*
4246+
* Altogether, this means that the number of bits available is constrained by
4247+
* the size of an unsigned long (at the upper end, subtracting two bits per the
4248+
* above), and the definition of PP_SIGNATURE (with or without
4249+
* POISON_POINTER_DELTA).
4250+
*/
4251+
#define PP_DMA_INDEX_SHIFT (1 + __fls(PP_SIGNATURE - POISON_POINTER_DELTA))
4252+
#if POISON_POINTER_DELTA > 0
4253+
/* PP_SIGNATURE includes POISON_POINTER_DELTA, so limit the size of the DMA
4254+
* index to not overlap with that if set
4255+
*/
4256+
#define PP_DMA_INDEX_BITS MIN(32, __ffs(POISON_POINTER_DELTA) - PP_DMA_INDEX_SHIFT)
4257+
#else
4258+
/* Always leave out the topmost two; see above. */
4259+
#define PP_DMA_INDEX_BITS MIN(32, BITS_PER_LONG - PP_DMA_INDEX_SHIFT - 2)
4260+
#endif
4261+
4262+
#define PP_DMA_INDEX_MASK GENMASK(PP_DMA_INDEX_BITS + PP_DMA_INDEX_SHIFT - 1, \
4263+
PP_DMA_INDEX_SHIFT)
4264+
42274265
/* Mask used for checking in page_pool_page_is_pp() below. page->pp_magic is
42284266
* OR'ed with PP_SIGNATURE after the allocation in order to preserve bit 0 for
4229-
* the head page of compound page and bit 1 for pfmemalloc page.
4230-
* page_is_pfmemalloc() is checked in __page_pool_put_page() to avoid recycling
4231-
* the pfmemalloc page.
4267+
* the head page of compound page and bit 1 for pfmemalloc page, as well as the
4268+
* bits used for the DMA index. page_is_pfmemalloc() is checked in
4269+
* __page_pool_put_page() to avoid recycling the pfmemalloc page.
42324270
*/
4233-
#define PP_MAGIC_MASK ~0x3UL
4271+
#define PP_MAGIC_MASK ~(PP_DMA_INDEX_MASK | 0x3UL)
42344272

42354273
#ifdef CONFIG_PAGE_POOL
42364274
static inline bool page_pool_page_is_pp(struct page *page)

include/linux/poison.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,10 @@
7070
#define KEY_DESTROY 0xbd
7171

7272
/********** net/core/page_pool.c **********/
73+
/*
74+
* page_pool uses additional free bits within this value to store data, see the
75+
* definition of PP_DMA_INDEX_MASK in mm.h
76+
*/
7377
#define PP_SIGNATURE (0x40 + POISON_POINTER_DELTA)
7478

7579
/********** net/core/skbuff.c **********/

include/net/page_pool/types.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
#include <linux/dma-direction.h>
77
#include <linux/ptr_ring.h>
88
#include <linux/types.h>
9+
#include <linux/xarray.h>
910
#include <net/netmem.h>
1011

1112
#define PP_FLAG_DMA_MAP BIT(0) /* Should page_pool do the DMA
@@ -33,6 +34,9 @@
3334
#define PP_FLAG_ALL (PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV | \
3435
PP_FLAG_SYSTEM_POOL | PP_FLAG_ALLOW_UNREADABLE_NETMEM)
3536

37+
/* Index limit to stay within PP_DMA_INDEX_BITS for DMA indices */
38+
#define PP_DMA_INDEX_LIMIT XA_LIMIT(1, BIT(PP_DMA_INDEX_BITS) - 1)
39+
3640
/*
3741
* Fast allocation side cache array/stack
3842
*
@@ -216,6 +220,8 @@ struct page_pool {
216220

217221
void *mp_priv;
218222

223+
struct xarray dma_mapped;
224+
219225
#ifdef CONFIG_PAGE_POOL_STATS
220226
/* recycle stats are per-cpu to avoid locking */
221227
struct page_pool_recycle_stats __percpu *recycle_stats;

net/core/netmem_priv.h

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66
static inline unsigned long netmem_get_pp_magic(netmem_ref netmem)
77
{
8-
return __netmem_clear_lsb(netmem)->pp_magic;
8+
return __netmem_clear_lsb(netmem)->pp_magic & ~PP_DMA_INDEX_MASK;
99
}
1010

1111
static inline void netmem_or_pp_magic(netmem_ref netmem, unsigned long pp_magic)
@@ -15,6 +15,8 @@ static inline void netmem_or_pp_magic(netmem_ref netmem, unsigned long pp_magic)
1515

1616
static inline void netmem_clear_pp_magic(netmem_ref netmem)
1717
{
18+
WARN_ON_ONCE(__netmem_clear_lsb(netmem)->pp_magic & PP_DMA_INDEX_MASK);
19+
1820
__netmem_clear_lsb(netmem)->pp_magic = 0;
1921
}
2022

@@ -33,4 +35,28 @@ static inline void netmem_set_dma_addr(netmem_ref netmem,
3335
{
3436
__netmem_clear_lsb(netmem)->dma_addr = dma_addr;
3537
}
38+
39+
static inline unsigned long netmem_get_dma_index(netmem_ref netmem)
40+
{
41+
unsigned long magic;
42+
43+
if (WARN_ON_ONCE(netmem_is_net_iov(netmem)))
44+
return 0;
45+
46+
magic = __netmem_clear_lsb(netmem)->pp_magic;
47+
48+
return (magic & PP_DMA_INDEX_MASK) >> PP_DMA_INDEX_SHIFT;
49+
}
50+
51+
static inline void netmem_set_dma_index(netmem_ref netmem,
52+
unsigned long id)
53+
{
54+
unsigned long magic;
55+
56+
if (WARN_ON_ONCE(netmem_is_net_iov(netmem)))
57+
return;
58+
59+
magic = netmem_get_pp_magic(netmem) | (id << PP_DMA_INDEX_SHIFT);
60+
__netmem_clear_lsb(netmem)->pp_magic = magic;
61+
}
3662
#endif

net/core/page_pool.c

Lines changed: 68 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -272,8 +272,7 @@ static int page_pool_init(struct page_pool *pool,
272272
/* Driver calling page_pool_create() also call page_pool_destroy() */
273273
refcount_set(&pool->user_cnt, 1);
274274

275-
if (pool->dma_map)
276-
get_device(pool->p.dev);
275+
xa_init_flags(&pool->dma_mapped, XA_FLAGS_ALLOC1);
277276

278277
if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) {
279278
/* We rely on rtnl_lock()ing to make sure netdev_rx_queue
@@ -311,9 +310,7 @@ static int page_pool_init(struct page_pool *pool,
311310
static void page_pool_uninit(struct page_pool *pool)
312311
{
313312
ptr_ring_cleanup(&pool->ring, NULL);
314-
315-
if (pool->dma_map)
316-
put_device(pool->p.dev);
313+
xa_destroy(&pool->dma_mapped);
317314

318315
#ifdef CONFIG_PAGE_POOL_STATS
319316
if (!pool->system)
@@ -454,13 +451,21 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
454451
netmem_ref netmem,
455452
u32 dma_sync_size)
456453
{
457-
if (pool->dma_sync && dma_dev_need_sync(pool->p.dev))
458-
__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
454+
if (pool->dma_sync && dma_dev_need_sync(pool->p.dev)) {
455+
rcu_read_lock();
456+
/* re-check under rcu_read_lock() to sync with page_pool_scrub() */
457+
if (pool->dma_sync)
458+
__page_pool_dma_sync_for_device(pool, netmem,
459+
dma_sync_size);
460+
rcu_read_unlock();
461+
}
459462
}
460463

461-
static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
464+
static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem, gfp_t gfp)
462465
{
463466
dma_addr_t dma;
467+
int err;
468+
u32 id;
464469

465470
/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
466471
* since dma_addr_t can be either 32 or 64 bits and does not always fit
@@ -474,15 +479,30 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
474479
if (dma_mapping_error(pool->p.dev, dma))
475480
return false;
476481

477-
if (page_pool_set_dma_addr_netmem(netmem, dma))
482+
if (page_pool_set_dma_addr_netmem(netmem, dma)) {
483+
WARN_ONCE(1, "unexpected DMA address, please report to netdev@");
478484
goto unmap_failed;
485+
}
486+
487+
if (in_softirq())
488+
err = xa_alloc(&pool->dma_mapped, &id, netmem_to_page(netmem),
489+
PP_DMA_INDEX_LIMIT, gfp);
490+
else
491+
err = xa_alloc_bh(&pool->dma_mapped, &id, netmem_to_page(netmem),
492+
PP_DMA_INDEX_LIMIT, gfp);
493+
if (err) {
494+
WARN_ONCE(err != -ENOMEM, "couldn't track DMA mapping, please report to netdev@");
495+
goto unset_failed;
496+
}
479497

498+
netmem_set_dma_index(netmem, id);
480499
page_pool_dma_sync_for_device(pool, netmem, pool->p.max_len);
481500

482501
return true;
483502

503+
unset_failed:
504+
page_pool_set_dma_addr_netmem(netmem, 0);
484505
unmap_failed:
485-
WARN_ONCE(1, "unexpected DMA address, please report to netdev@");
486506
dma_unmap_page_attrs(pool->p.dev, dma,
487507
PAGE_SIZE << pool->p.order, pool->p.dma_dir,
488508
DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
@@ -499,7 +519,7 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
499519
if (unlikely(!page))
500520
return NULL;
501521

502-
if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page)))) {
522+
if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page), gfp))) {
503523
put_page(page);
504524
return NULL;
505525
}
@@ -546,7 +566,7 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
546566
*/
547567
for (i = 0; i < nr_pages; i++) {
548568
netmem = pool->alloc.cache[i];
549-
if (dma_map && unlikely(!page_pool_dma_map(pool, netmem))) {
569+
if (dma_map && unlikely(!page_pool_dma_map(pool, netmem, gfp))) {
550570
put_page(netmem_to_page(netmem));
551571
continue;
552572
}
@@ -648,6 +668,8 @@ void page_pool_clear_pp_info(netmem_ref netmem)
648668
static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
649669
netmem_ref netmem)
650670
{
671+
struct page *old, *page = netmem_to_page(netmem);
672+
unsigned long id;
651673
dma_addr_t dma;
652674

653675
if (!pool->dma_map)
@@ -656,13 +678,25 @@ static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
656678
*/
657679
return;
658680

681+
id = netmem_get_dma_index(netmem);
682+
if (!id)
683+
return;
684+
685+
if (in_softirq())
686+
old = xa_cmpxchg(&pool->dma_mapped, id, page, NULL, 0);
687+
else
688+
old = xa_cmpxchg_bh(&pool->dma_mapped, id, page, NULL, 0);
689+
if (old != page)
690+
return;
691+
659692
dma = page_pool_get_dma_addr_netmem(netmem);
660693

661694
/* When page is unmapped, it cannot be returned to our pool */
662695
dma_unmap_page_attrs(pool->p.dev, dma,
663696
PAGE_SIZE << pool->p.order, pool->p.dma_dir,
664697
DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
665698
page_pool_set_dma_addr_netmem(netmem, 0);
699+
netmem_set_dma_index(netmem, 0);
666700
}
667701

668702
/* Disconnects a page (from a page_pool). API users can have a need
@@ -1037,8 +1071,29 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
10371071

10381072
static void page_pool_scrub(struct page_pool *pool)
10391073
{
1074+
unsigned long id;
1075+
void *ptr;
1076+
10401077
page_pool_empty_alloc_cache_once(pool);
1041-
pool->destroy_cnt++;
1078+
if (!pool->destroy_cnt++ && pool->dma_map) {
1079+
if (pool->dma_sync) {
1080+
/* Disable page_pool_dma_sync_for_device() */
1081+
pool->dma_sync = false;
1082+
1083+
/* Make sure all concurrent returns that may see the old
1084+
* value of dma_sync (and thus perform a sync) have
1085+
* finished before doing the unmapping below. Skip the
1086+
* wait if the device doesn't actually need syncing, or
1087+
* if there are no outstanding mapped pages.
1088+
*/
1089+
if (dma_dev_need_sync(pool->p.dev) &&
1090+
!xa_empty(&pool->dma_mapped))
1091+
synchronize_net();
1092+
}
1093+
1094+
xa_for_each(&pool->dma_mapped, id, ptr)
1095+
__page_pool_release_page_dma(pool, page_to_netmem(ptr));
1096+
}
10421097

10431098
/* No more consumers should exist, but producers could still
10441099
* be in-flight.

0 commit comments

Comments
 (0)