fast-path: structural fix — per-CPU scratch map for bpf_fib_lookup, kill MutationCtx padding#50
Merged
Merged
Conversation
…adding Three iterations of source-level fixes (struct literal, MaybeUninit + raw pointer writes, the no_mangle/inline shim attempt) have all failed to eliminate LLVM's memset libcalls on this UniFi build. The verifier dump from the post-PR-#49 install still shows three bpf-to-bpf memset subprogram calls (3 / 6 / 22 bytes), and the user asked us to stop iterating on band-aids and use a robust pattern. The structural fix: 1. Move `bpf_fib_lookup` to a per-CPU map (`FIB_LOOKUP_SCRATCH`). Per-CPU array elements are pre-zeroed by the kernel at map creation; we only ever write the input fields the kernel reads. This is the canonical BPF pattern for scratch buffers and completely sidesteps LLVM's stack zero-init optimization. No stack allocation, no MaybeUninit dance, no merge-able zero stores anywhere in the IR. 2. Collapse `MutationCtx { is_v4: u8, _pad: [u8; 3] }` into a single `is_v4: u32` field. With the padding bytes gone, LLVM has no alignment-completion zero block to recognize and lower into a memset libcall. Single u32 store, no padding, modern repr. These are textbook BPF design choices (per-CPU map for scratch, zero-padding-free struct layouts), not workarounds for a specific LLVM optimization. Wire-format compatibility: MutationCtx grows from 16 bytes (4+2+2+4+1+3) to 16 bytes (4+2+2+4+4) — same total, same field offsets through ip_offset, only the trailing byte interpretation changes (u8 → u32 low byte). Finalize reads `mctx.is_v4 != 0` which works for both layouts. No userspace mirror struct exists to keep in sync. The new FIB_LOOKUP_SCRATCH map is added to the userspace pin list so detach unpins it cleanly along with the other maps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stop iterating on source patterns; use the right BPF idiom
The post-PR-#49 install on UniFi still trips the verifier. Verifier dump still shows three bpf-to-bpf memset subprogram calls (3 / 6 / 22 bytes) — LLVM keeps finding patterns to merge regardless of how I write the source. We need to stop fighting LLVM and use the canonical BPF design pattern.
Two structural changes
1. `bpf_fib_lookup` → per-CPU scratch map. Stack-allocated bpf_fib_lookup is what LLVM was zero-coalescing into memsets. Per-CPU array elements are pre-zeroed at map creation by the kernel and remain accessible at a stable address — fast_path writes only the input fields, the kernel helper fills smac/dmac on success, and the value persists per-CPU between calls. This is the textbook BPF pattern for scratch buffers (libbpf's CO-RE projects all do this; it shows up in cilium, sysdig, etc.).
```rust
#[map]
pub static FIB_LOOKUP_SCRATCH: PerCpuArray<bpf_fib_lookup> =
PerCpuArray::with_max_entries(1, 0);
```
2. `MutationCtx { is_v4: u8, _pad: [u8; 3] }` → `is_v4: u32`. The padding bytes were what LLVM was alignment-completing into a 3-byte memset (visible at instruction 4410 in every dump). Collapsing to a single u32 field eliminates the padding entirely. Single store, no merge-able pattern. `finalize` reads `mctx.is_v4 != 0` which works identically for u8 and u32 low-byte semantics.
What this is not
Not a band-aid. Not "remove this one zero store and hope LLVM stops." Both changes are how BPF programs are supposed to be written:
Test plan
Wire-format note
`MutationCtx` size is unchanged (16 bytes both before and after). Field offsets through `ip_offset` are unchanged. Only the trailing 4 bytes' interpretation changes (`is_v4: u8 + _pad: [u8; 3]` → `is_v4: u32`). No userspace mirror struct exists to keep in sync.
🤖 Generated with Claude Code