fast-path: structural fix — per-CPU scratch map for bpf_fib_lookup, kill MutationCtx padding by lunarthegrey · Pull Request #50 · unredacted/packetframe

lunarthegrey · 2026-05-05T16:13:21Z

Stop iterating on source patterns; use the right BPF idiom

The post-PR-#49 install on UniFi still trips the verifier. Verifier dump still shows three bpf-to-bpf memset subprogram calls (3 / 6 / 22 bytes) — LLVM keeps finding patterns to merge regardless of how I write the source. We need to stop fighting LLVM and use the canonical BPF design pattern.

Two structural changes

1. `bpf_fib_lookup` → per-CPU scratch map. Stack-allocated bpf_fib_lookup is what LLVM was zero-coalescing into memsets. Per-CPU array elements are pre-zeroed at map creation by the kernel and remain accessible at a stable address — fast_path writes only the input fields, the kernel helper fills smac/dmac on success, and the value persists per-CPU between calls. This is the textbook BPF pattern for scratch buffers (libbpf's CO-RE projects all do this; it shows up in cilium, sysdig, etc.).

```rust
#[map]
pub static FIB_LOOKUP_SCRATCH: PerCpuArray<bpf_fib_lookup> =
PerCpuArray::with_max_entries(1, 0);
```

2. `MutationCtx { is_v4: u8, _pad: [u8; 3] }` → `is_v4: u32`. The padding bytes were what LLVM was alignment-completing into a 3-byte memset (visible at instruction 4410 in every dump). Collapsing to a single u32 field eliminates the padding entirely. Single store, no merge-able pattern. `finalize` reads `mctx.is_v4 != 0` which works identically for u8 and u32 low-byte semantics.

What this is not

Not a band-aid. Not "remove this one zero store and hope LLVM stops." Both changes are how BPF programs are supposed to be written:

Per-CPU maps for scratch buffers larger than ~32 bytes (avoids stack budget pressure AND zero-init issues).
Repr-C structs without padding bytes (avoids alignment-completion stores).

Test plan

CI green.
Verifier dump on UniFi: zero `(85) call pc+N` instructions in fast_path or finalize.
`sudo bpftool prog show name fast_path` reports `jited:` non-zero.
`sudo packetframe feasibility --human`: all xdp.attach.ethN PASS.
Forwarding works (IPv4 + IPv6 + per-prefix mss-clamp).

Wire-format note

`MutationCtx` size is unchanged (16 bytes both before and after). Field offsets through `ip_offset` are unchanged. Only the trailing 4 bytes' interpretation changes (`is_v4: u8 + _pad: [u8; 3]` → `is_v4: u32`). No userspace mirror struct exists to keep in sync.

🤖 Generated with Claude Code

…adding Three iterations of source-level fixes (struct literal, MaybeUninit + raw pointer writes, the no_mangle/inline shim attempt) have all failed to eliminate LLVM's memset libcalls on this UniFi build. The verifier dump from the post-PR-#49 install still shows three bpf-to-bpf memset subprogram calls (3 / 6 / 22 bytes), and the user asked us to stop iterating on band-aids and use a robust pattern. The structural fix: 1. Move `bpf_fib_lookup` to a per-CPU map (`FIB_LOOKUP_SCRATCH`). Per-CPU array elements are pre-zeroed by the kernel at map creation; we only ever write the input fields the kernel reads. This is the canonical BPF pattern for scratch buffers and completely sidesteps LLVM's stack zero-init optimization. No stack allocation, no MaybeUninit dance, no merge-able zero stores anywhere in the IR. 2. Collapse `MutationCtx { is_v4: u8, _pad: [u8; 3] }` into a single `is_v4: u32` field. With the padding bytes gone, LLVM has no alignment-completion zero block to recognize and lower into a memset libcall. Single u32 store, no padding, modern repr. These are textbook BPF design choices (per-CPU map for scratch, zero-padding-free struct layouts), not workarounds for a specific LLVM optimization. Wire-format compatibility: MutationCtx grows from 16 bytes (4+2+2+4+1+3) to 16 bytes (4+2+2+4+4) — same total, same field offsets through ip_offset, only the trailing byte interpretation changes (u8 → u32 low byte). Finalize reads `mctx.is_v4 != 0` which works for both layouts. No userspace mirror struct exists to keep in sync. The new FIB_LOOKUP_SCRATCH map is added to the userspace pin list so detach unpins it cleanly along with the other maps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lunarthegrey merged commit 9f78f2e into main May 5, 2026
10 checks passed

lunarthegrey deleted the v0.2.6-percpu-scratch branch May 5, 2026 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fast-path: structural fix — per-CPU scratch map for bpf_fib_lookup, kill MutationCtx padding#50

fast-path: structural fix — per-CPU scratch map for bpf_fib_lookup, kill MutationCtx padding#50
lunarthegrey merged 1 commit into
mainfrom
v0.2.6-percpu-scratch

lunarthegrey commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lunarthegrey commented May 5, 2026

Stop iterating on source patterns; use the right BPF idiom

Two structural changes

What this is not

Test plan

Wire-format note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant