Skip to content

fast-path: structural fix — per-CPU scratch map for bpf_fib_lookup, kill MutationCtx padding#50

Merged
lunarthegrey merged 1 commit into
mainfrom
v0.2.6-percpu-scratch
May 5, 2026
Merged

fast-path: structural fix — per-CPU scratch map for bpf_fib_lookup, kill MutationCtx padding#50
lunarthegrey merged 1 commit into
mainfrom
v0.2.6-percpu-scratch

Conversation

@lunarthegrey
Copy link
Copy Markdown
Contributor

Stop iterating on source patterns; use the right BPF idiom

The post-PR-#49 install on UniFi still trips the verifier. Verifier dump still shows three bpf-to-bpf memset subprogram calls (3 / 6 / 22 bytes) — LLVM keeps finding patterns to merge regardless of how I write the source. We need to stop fighting LLVM and use the canonical BPF design pattern.

Two structural changes

1. `bpf_fib_lookup` → per-CPU scratch map. Stack-allocated bpf_fib_lookup is what LLVM was zero-coalescing into memsets. Per-CPU array elements are pre-zeroed at map creation by the kernel and remain accessible at a stable address — fast_path writes only the input fields, the kernel helper fills smac/dmac on success, and the value persists per-CPU between calls. This is the textbook BPF pattern for scratch buffers (libbpf's CO-RE projects all do this; it shows up in cilium, sysdig, etc.).

```rust
#[map]
pub static FIB_LOOKUP_SCRATCH: PerCpuArray<bpf_fib_lookup> =
PerCpuArray::with_max_entries(1, 0);
```

2. `MutationCtx { is_v4: u8, _pad: [u8; 3] }` → `is_v4: u32`. The padding bytes were what LLVM was alignment-completing into a 3-byte memset (visible at instruction 4410 in every dump). Collapsing to a single u32 field eliminates the padding entirely. Single store, no merge-able pattern. `finalize` reads `mctx.is_v4 != 0` which works identically for u8 and u32 low-byte semantics.

What this is not

Not a band-aid. Not "remove this one zero store and hope LLVM stops." Both changes are how BPF programs are supposed to be written:

  • Per-CPU maps for scratch buffers larger than ~32 bytes (avoids stack budget pressure AND zero-init issues).
  • Repr-C structs without padding bytes (avoids alignment-completion stores).

Test plan

  • CI green.
  • Verifier dump on UniFi: zero `(85) call pc+N` instructions in fast_path or finalize.
  • `sudo bpftool prog show name fast_path` reports `jited:` non-zero.
  • `sudo packetframe feasibility --human`: all xdp.attach.ethN PASS.
  • Forwarding works (IPv4 + IPv6 + per-prefix mss-clamp).

Wire-format note

`MutationCtx` size is unchanged (16 bytes both before and after). Field offsets through `ip_offset` are unchanged. Only the trailing 4 bytes' interpretation changes (`is_v4: u8 + _pad: [u8; 3]` → `is_v4: u32`). No userspace mirror struct exists to keep in sync.

🤖 Generated with Claude Code

…adding

Three iterations of source-level fixes (struct literal, MaybeUninit +
raw pointer writes, the no_mangle/inline shim attempt) have all failed
to eliminate LLVM's memset libcalls on this UniFi build. The verifier
dump from the post-PR-#49 install still shows three bpf-to-bpf memset
subprogram calls (3 / 6 / 22 bytes), and the user asked us to stop
iterating on band-aids and use a robust pattern.

The structural fix:

  1. Move `bpf_fib_lookup` to a per-CPU map (`FIB_LOOKUP_SCRATCH`).
     Per-CPU array elements are pre-zeroed by the kernel at map
     creation; we only ever write the input fields the kernel reads.
     This is the canonical BPF pattern for scratch buffers and
     completely sidesteps LLVM's stack zero-init optimization. No
     stack allocation, no MaybeUninit dance, no merge-able zero
     stores anywhere in the IR.

  2. Collapse `MutationCtx { is_v4: u8, _pad: [u8; 3] }` into a single
     `is_v4: u32` field. With the padding bytes gone, LLVM has no
     alignment-completion zero block to recognize and lower into a
     memset libcall. Single u32 store, no padding, modern repr.

These are textbook BPF design choices (per-CPU map for scratch,
zero-padding-free struct layouts), not workarounds for a specific
LLVM optimization.

Wire-format compatibility: MutationCtx grows from 16 bytes (4+2+2+4+1+3)
to 16 bytes (4+2+2+4+4) — same total, same field offsets through
ip_offset, only the trailing byte interpretation changes (u8 → u32 low
byte). Finalize reads `mctx.is_v4 != 0` which works for both layouts.
No userspace mirror struct exists to keep in sync.

The new FIB_LOOKUP_SCRATCH map is added to the userspace pin list so
detach unpins it cleanly along with the other maps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lunarthegrey lunarthegrey merged commit 9f78f2e into main May 5, 2026
10 checks passed
@lunarthegrey lunarthegrey deleted the v0.2.6-percpu-scratch branch May 5, 2026 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant