Skip to content

Conversation

@zjmletang
Copy link
Member

Hello everyone — I’ve prepared a preliminary PR that adds support for the netkvm “mergeable” feature. This can effectively reduce memory usage; I’d appreciate it if you could take a moment to review it when you have time. Below are the details/explanations for this PR.

VirtIO Mergeable Receive Buffers Implementation Design

Overview

This document describes the implementation of VirtIO mergeable receive buffers (VIRTIO_NET_F_MRG_RXBUF) support for the Windows NetKVM driver. The implementation optimizes memory usage and reduces buffer allocation overhead for high-throughput network scenarios.

Background

Problem Statement

The traditional receive path allocates large buffers (up to 64KB) to accommodate maximum-sized packets. This approach:

  • Wastes memory when most packets are small (e.g., TCP ACKs, DNS queries)
  • Limits the number of available receive buffers due to memory constraints
  • Increases buffer pool exhaustion under heavy traffic

Solution: Mergeable Receive Buffers

VirtIO mergeable buffers allow:

  • Small fixed-size buffers (4KB pages) for all packets
  • Multi-buffer packet assembly for large packets
  • Better memory utilization and larger buffer pools
  • Reduced allocation overhead

Design Principles

1. Conditional Activation

Only enable mergeable buffers when BOTH features are present:

  • VIRTIO_NET_F_MRG_RXBUF (mergeable buffer support)
  • VIRTIO_F_ANY_LAYOUT (VirtIO 1.0+ combined header+data layout)

Rationale: This is a pragmatic engineering decision. While VIRTIO_NET_F_MRG_RXBUF and VIRTIO_F_ANY_LAYOUT are theoretically independent features per VirtIO specification, we require both for mergeable buffer support to:

  • Simplify implementation: Combined layout allows single scatter-gather entry per buffer
  • Reduce code paths: Eliminates need for separate header/data descriptor logic
  • Minimize testing matrix: Avoids testing mergeable mode with legacy split-descriptor layout
  • Target modern devices: VirtIO 1.0+ devices (released 2016+) support both features

Legacy VirtIO 0.95 devices lacking ANY_LAYOUT automatically fall back to traditional non-mergeable path.

2. Zero-Allocation Hot Path

All data structures for packet assembly are pre-allocated:

  • Inline arrays in merge context: Buffer references and physical page arrays use fixed-size inline storage (BufferSequence[17], PhysicalPages[18]) embedded in the context structure, eliminating per-packet heap allocation overhead.

  • Stack-based storage for buffer references: All temporary tracking uses stack variables or pre-allocated context members, avoiding dynamic memory management in the receive hot path.

  • Bounds enforced at compile time: Array sizes are compile-time constants (#define VIRTIO_NET_MAX_MRG_BUFS 17), not runtime variables. This allows the compiler to enforce bounds checking and eliminates the need for runtime validation checks, improving performance and safety.

3. Backward Compatibility

Maintain compatibility with existing code paths:

  • Physical page aliasing preserves ParaNdis_BindRxBufferToPacket logic: The existing MDL binding function always starts from PhysicalPages[1] (legacy design for traditional mode). By creating an alias where both PhysicalPages[0] and [1] point to the same physical memory, this function works correctly for mergeable buffers without modification, reducing regression risk.

  • Separate creation paths for mergeable vs. traditional buffers: Two independent buffer allocation functions (CreateMergeableRxDescriptor() for small 4KB buffers, CreateRxDescriptorOnInit() for large multi-page buffers) are selected based on feature flags. This separation keeps each mode's logic clean and isolated, avoiding complex conditional branches in shared code.

  • No changes to core packet processing: After assembly, both modes produce identical MDL chain structures. Upper-layer processing (checksum validation, RSS classification, hardware offload) operates on standard MDL chains regardless of buffer origin. This interface consistency ensures existing packet handling logic works unchanged for both modes.

Architecture

Key Data Structures

1. RxNetDescriptor Extensions

struct _tagRxNetDescriptor {
    // Modified fields:
    USHORT NumPages;                           // Logical page count (exposed to offload engines)
    USHORT NumOwnedPages;                      // Physical ownership (vs. logical pages)
    
    // New fields for mergeable support:
    tCompletePhysicalAddress *OriginalPhysicalPages;  // Saved for restoration
    USHORT MergedBufferCount;                  // Additional buffers (excluding this one)
    pRxNetDescriptor MergedBuffers[16];  // Inline storage (no allocation)
};

Design Notes:

  • NumPages: Semantic change - now represents logical page count for complete packet after assembly
    • Traditional mode: Physical page count = logical page count (same value)
    • Mergeable mode: Logical count > physical count (includes pages from additional buffers)
    • Must remain accurate for checksum/offload engines to process full packet correctly
  • NumOwnedPages: Prevents double-free during cleanup (only free owned pages)
    • Always equals the actual physical pages this descriptor owns
    • Mergeable mode: Always 2 for single-buffer descriptor
    • After assembly: Still 2 (doesn't change, only first buffer owns its pages)
  • OriginalPhysicalPages: Enables pointer restoration after merge assembly
  • MergedBuffers array: Avoids heap allocation, sized for worst case (17 buffers max)

2. Merge Context Structure

struct _MergeBufferContext {
    pRxNetDescriptor BufferSequence[17];           // Buffer collection
    UINT32 BufferActualLengths[17];                // Received lengths
    UINT16 ExpectedBuffers;                        // From virtio header
    UINT16 CollectedBuffers;                       // Current count
    UINT32 TotalPacketLength;                      // Accumulated size
    tCompletePhysicalAddress PhysicalPages[18];  // Pre-allocated array
};

Design Notes:

  • VIRTIO_NET_MAX_MRG_BUFS = 17: Calculated from max packet size (65562 bytes / 4096 bytes per buffer)
  • MAX_MERGED_PHYSICAL_PAGES = 18: First buffer (2 logical pages) + 16 additional buffers
  • All arrays are compile-time sized to avoid runtime allocation

Physical Page Aliasing Design

Challenge: ParaNdis_BindRxBufferToPacket always starts from PARANDIS_FIRST_RX_DATA_PAGE (index 1).

Solution: Create an alias where both PhysicalPages[0] and PhysicalPages[1] point to the same physical memory.

Traditional mode:              Mergeable mode (aliasing):
PhysicalPages[0] → Header      PhysicalPages[0] ───┐
PhysicalPages[1] → Data page 1                     ├→ Same 4KB page
PhysicalPages[2] → Data page 2 PhysicalPages[1] ───┘

Safety: IsRegionInside() detects the alias during cleanup and skips freeing PhysicalPages[1].

Trade-off: +8 bytes per descriptor (1000 descriptors = 8KB overhead), acceptable for compatibility.

Implementation Details

Buffer Creation Path

CreateMergeableRxDescriptor()

Allocates simplified 4KB buffers for mergeable mode:

  1. Allocate single 4KB physical page
  2. Create 2-entry PhysicalPages array (aliasing design)
  3. Set both entries to point to the same physical page
  4. Create single scatter-gather entry (ANY_LAYOUT mode)
  5. Bind MDL starting from index 1 (compatibility)

Key Parameters:

  • NumPages = 2 (logical)
  • NumOwnedPages = 2 (same for single buffer)
  • BufferSGLength = 1 (combined header+data)
  • DataStartOffset = nVirtioHeaderSize

Packet Assembly Path

ProcessMergedBuffers()

Main entry point for mergeable packet handling:

  1. Read num_buffers from virtio header
  2. Validate range (1-17 buffers)
  3. Handle single-buffer case (fast path, no assembly)
  4. Initialize merge context
  5. Collect remaining buffers via CollectRemainingMergeBuffers()
  6. Assemble packet via AssembleMergedPacket()

Error Handling:

  • Invalid num_buffers: Drop packet, reuse first buffer
  • Collection failure: Drop packet, reuse all collected buffers
  • Assembly failure: Drop packet, reuse all collected buffers

CollectRemainingMergeBuffers()

Retrieves buffers 2..N from virtqueue:

VirtIO Protocol Guarantee: All buffers for a merged packet are atomically available.

Implementation:

  • Collect remaining buffers based on num_buffers from virtio header
  • Store actual received lengths for each buffer

AssembleMergedPacket()

Combines multiple buffers into single packet:

  1. Save buffer references: Store additional buffers in MergedBuffers array
  2. Switch to inline PhysicalPages: Use pre-allocated PhysicalPages array from merge context
  3. Copy page references: First buffer (2 pages) + additional buffers (1 page each)
  4. Create MDLs: New MDLs for additional buffers covering full payload (no header offset)
  5. Update counts: NumPages (logical), MergedBufferCount (additional buffers)

Page Calculation:

totalPages = 2 (first buffer) + (CollectedBuffers - 1) (additional)
           = 1 + CollectedBuffers

MDL Creation: Additional buffers use PhysicalPages[PARANDIS_FIRST_RX_DATA_PAGE] (index 1, aliased to same page as [0] in mergeable mode, for consistency with ParaNdis_BindRxBufferToPacket).

Buffer Reuse Path

ReuseReceiveBufferNoLock()

Enhanced to handle merged packets:

  1. Check MergedBufferCount > 0
  2. Recursively reuse all additional buffers
  3. Call DisassembleMergedPacket() to restore state
  4. Standard reuse logic

DisassembleMergedPacket()

Inverse operation of AssembleMergedPacket():

  1. Free extended MDL chain (keep first buffer's original MDL)
  2. Restore PhysicalPages pointer from inline array to original
  3. Reset NumPages = 2, NumOwnedPages = 2
  4. Clear MergedBufferCount = 0

Result: Buffer returns to pristine single-buffer state for reuse.

Memory Footprint

Per-Descriptor Structure Overhead

Mergeable mode adds new fields to RxNetDescriptor structure:

Additional fields compared to traditional mode:
   +8 bytes (OriginalPhysicalPages pointer)
   +2 bytes (NumOwnedPages)
   +2 bytes (MergedBufferCount)
   +128 bytes (MergedBuffers array, 16 * 8 bytes)
   = +140 bytes per descriptor

Per-Queue Context Overhead

Mergeable mode adds _MergeBufferContext to each RX queue:

BufferSequence[17]:         17 * 8  = 136 bytes
BufferActualLengths[17]:    17 * 4  = 68 bytes
ExpectedBuffers:            2 bytes
CollectedBuffers:           2 bytes
TotalPacketLength:          4 bytes
PhysicalPages[18]:          18 * 24 = 432 bytes
                            Total:    644 bytes per queue

Per-Buffer Physical Memory Allocation

Actual shared memory (DMA-capable) allocated per buffer:

Traditional mode:  Up to 18 physical pages per buffer (max ~72KB)
                   Layout: 1 header page + up to 17 data pages
                   Note: Must pre-allocate for worst-case packet size
                   
Mergeable mode:    1 physical page per buffer (4KB)
                   Layout: Single 4KB page (header + data combined)
                   Note: 2 logical pages via aliasing, but same physical page

Total Impact (Example: 4096 buffers/queue, 1 queue)

Additional metadata overhead:
  Descriptor structure: 4096 * 140 bytes = 574 KB
  Queue context:        1 * 644 bytes   = 0.6 KB
  Total overhead:                         574.6 KB

Shared memory savings:
  Traditional: 4096 * 18 pages * 4KB = ~288 MB (worst-case pre-allocation)
  Mergeable:   4096 * 1 page * 4KB   = ~16 MB
  Net savings:                          ~272 MB (94% reduction)

Conclusion: Minimal metadata overhead (~575 KB) enables significant shared memory savings (~272 MB).

Error Handling

Protocol Violations

  1. Invalid num_buffers (0 or >17):

    • Log error, drop packet
    • Reuse first buffer immediately
  2. Missing buffers (GetBuf returns NULL):

    • Log protocol violation error
    • Reuse all collected buffers
  3. Buffer overflow (>16 additional buffers):

    • Log critical error, drop packet
    • Reuse all collected buffers
    • Should never happen (pre-validated)

Resource Exhaustion

  1. MDL allocation failure:
    • Log error, abort packet assembly, return NULL
    • Caller reuses all collected buffers
    • Partial MDL chain is cleaned up automatically
    • Entire packet is dropped (ensures data integrity)
    • Rare in practice

Appendix: Code Changes Summary

Key Functions Added

  • CreateMergeableRxDescriptor(): Simplified buffer creation
  • ProcessMergedBuffers(): Main assembly coordinator
  • CollectRemainingMergeBuffers(): Buffer collection
  • AssembleMergedPacket(): Multi-buffer packet assembly
  • DisassembleMergedPacket(): State restoration for reuse
  • ReuseCollectedBuffers(): Batch buffer return
  • ProcessReceivedPacket(): Encapsulates packet analysis, filtering, and RSS processing

Key Functions Modified

  • ReuseReceiveBufferNoLock(): Added merged packet handling
  • CreateRxDescriptorOnInit(): Added path routing
  • ProcessRxRing(): Integrated mergeable path

Add NumOwnedPages to track descriptor-owned memory pages for cleanup. Add OriginalPhysicalPages to save original page array pointer before merge operations. Add MergedBufferCount and MergedBuffers array for inline storage of merged buffer pointers, eliminating dynamic allocation in hot path. Maximum 16 additional buffers supported per VirtIO spec (65562 bytes / 4096 bytes per buffer).

Signed-off-by: Zhang JianMing <[email protected]>
Add _MergeBufferContext structure with pre-allocated arrays for mergeable packet assembly to avoid hot-path allocation. Includes buffer sequence tracking, actual lengths, and physical page storage (max 18 pages). Add method declarations: ProcessMergedBuffers for initial processing, CollectRemainingMergeBuffers for gathering additional buffers, AssembleMergedPacket for packet construction, ReuseCollectedBuffers for error cleanup, and DisassembleMergedPacket for buffer state restoration.

Signed-off-by: Zhang JianMing <[email protected]>
Implement CollectRemainingMergeBuffers to gather additional buffers from virtqueue based on num_buffers field. Implement AssembleMergedPacket to consolidate multiple buffers into single descriptor by expanding PhysicalPages array and creating MDL chain covering all payload data. Implement ReuseCollectedBuffers for error-path cleanup. Implement DisassembleMergedPacket to restore original buffer state and return merged buffers to pool. These methods enable zero-copy assembly of large packets split across multiple virtio buffers while maintaining backward compatibility with single-buffer mode.

Signed-off-by: Zhang JianMing <[email protected]>
@gemini-code-assist
Copy link

Summary of Changes

Hello @zjmletang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant optimization to the NetKVM driver by adding support for VirtIO mergeable receive buffers. This change fundamentally alters how network packets are received, moving from large, potentially wasteful buffers to smaller, 4KB page-sized buffers that can be dynamically merged to form larger packets. The primary goal is to drastically reduce memory consumption and allocation overhead, particularly in scenarios with high volumes of small packets. The implementation is carefully designed to ensure a high-performance, zero-allocation hot path for packet assembly while maintaining full backward compatibility with existing driver components and targeting modern VirtIO device capabilities.

Highlights

  • Mergeable Receive Buffers Support: Implements support for VirtIO mergeable receive buffers (VIRTIO_NET_F_MRG_RXBUF) in the Windows NetKVM driver, enabling more efficient memory usage for network packet reception.
  • Memory Optimization: Significantly reduces memory footprint by allocating small 4KB pages per buffer instead of large, fixed-size buffers (up to 64KB), especially beneficial for handling numerous small packets like TCP ACKs or DNS queries.
  • Conditional Activation: The mergeable buffer feature is activated only when both VIRTIO_NET_F_MRG_RXBUF and VIRTIO_F_ANY_LAYOUT are present, simplifying the implementation and targeting modern VirtIO 1.0+ devices.
  • Zero-Allocation Hot Path: Ensures high performance by utilizing pre-allocated inline arrays and stack-based storage for packet assembly, thereby avoiding dynamic memory allocation overhead in the critical receive hot path.
  • Backward Compatibility: Maintains compatibility with existing driver logic through techniques like physical page aliasing for legacy functions and separate buffer creation paths for mergeable and traditional modes, minimizing regression risk.
  • Enhanced Data Structures and Logic: Extends the RxNetDescriptor with new fields (NumOwnedPages, OriginalPhysicalPages, MergedBufferCount, MergedBuffers array) and introduces a _MergeBufferContext for per-queue state management. New functions are added for creating mergeable descriptors, processing, assembling, and disassembling merged packets.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@YanVugenfirer
Copy link
Collaborator

[Jenkins CI]: Can one of the admins verify this patch?

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for mergeable receive buffers, a significant feature to optimize memory usage. The implementation is well-designed, following the detailed plan in the PR description. Key aspects like pre-allocating context structures to avoid hot-path allocations, ensuring backward compatibility through physical page aliasing, and clear separation of logic for mergeable vs. traditional paths are well executed. The code is mostly clean and correct. I've found a few areas for improvement, mainly related to code clarity and removing redundant checks, which should enhance maintainability.

Comment on lines +486 to +506
PMDL pMDL = pBuffer->Holder;
USHORT mdlCount = 0;

while (pMDL && mdlCount < 1)
{
pMDL = NDIS_MDL_LINKAGE(pMDL);
mdlCount++;
}

while (pMDL)
{
PMDL pNextMDL = NDIS_MDL_LINKAGE(pMDL);
NdisFreeMdl(pMDL);
pMDL = pNextMDL;
}

pMDL = pBuffer->Holder;
if (pMDL)
{
NDIS_MDL_LINKAGE(pMDL) = NULL;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for freeing the MDL chain of merged buffers is correct but could be simplified for better readability and maintainability. The current implementation uses a while loop with a counter to skip the first MDL, then another while loop to free the rest, and finally a separate block to unlink the chain. This can be done more concisely by unlinking the chain first and then iterating through the part to be freed.

    if (pBuffer->Holder)
    {
        PMDL pMdlChainToFree = NDIS_MDL_LINKAGE(pBuffer->Holder);

        // Unlink the additional MDLs from the primary buffer's MDL
        NDIS_MDL_LINKAGE(pBuffer->Holder) = NULL;

        // Free the chain of additional MDLs
        while (pMdlChainToFree)
        {
            PMDL pNextMDL = NDIS_MDL_LINKAGE(pMdlChainToFree);
            NdisFreeMdl(pMdlChainToFree);
            pMdlChainToFree = pNextMDL;
        }
    }

Comment on lines +1126 to +1136
// CRITICAL: Prevent buffer overflow - inline array has limited capacity
if (additionalBuffers > MAX_MERGED_BUFFERS)
{
DPrintf(0,
"ERROR: Too many merged buffers %u (max: %u) - dropping packet",
m_MergeContext.CollectedBuffers,
MAX_MERGED_BUFFERS + 1);

ReuseCollectedBuffers();
return NULL;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This check for additionalBuffers > MAX_MERGED_BUFFERS is redundant and represents unreachable code. The number of buffers is validated earlier in ProcessMergedBuffers against VIRTIO_NET_MAX_MRG_BUFS (17). Since additionalBuffers is m_MergeContext.CollectedBuffers - 1, and m_MergeContext.CollectedBuffers is at most 17, additionalBuffers can be at most 16. MAX_MERGED_BUFFERS is also 16, so this condition will never be true. Removing this dead code will improve clarity.

// Range: 0 (single buffer) to 16 (max merged packet)
// MergedBuffersInline: Array storing pointers to the 16 additional buffers
// (this descriptor itself is not stored in the array)
#define MAX_MERGED_BUFFERS 16

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The macro MAX_MERGED_BUFFERS is defined inside the _tagRxNetDescriptor struct. While syntactically valid, it's unconventional and harms readability. It's standard practice to define macros at the file scope, before the struct that uses them. Please move this definition outside and before the _tagRxNetDescriptor struct.

@kostyanf14
Copy link
Member

ok to test

@ybendito
Copy link
Collaborator

ybendito commented Nov 6, 2025

@zjmletang Thank you for the PR, I'll review it as soon as I have a time, but not immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants