Skip to content

Generate payload for multiple datagrams at once#609

Open
kazuho wants to merge 23 commits intomasterfrom
kazuho/scatter-stream
Open

Generate payload for multiple datagrams at once#609
kazuho wants to merge 23 commits intomasterfrom
kazuho/scatter-stream

Conversation

@kazuho
Copy link
Copy Markdown
Member

@kazuho kazuho commented Jun 16, 2025

Up until now, on_send_emit callback has been invoked for each STREAM frame being built. This has become a bottleneck, due to two reasons:

  • Applications might have high static cost for generating each payload. For examples, they might be calling pread for each call to on_send_emit.
  • Running Accounting and prioritization logic for each packet being built is also expensive.

To mitigate the issuse, this PR refactors the quicly_send_stream function to generate STREAM frames for as much as 10 packets at once.

This PR calls the on_send_emit callback that already exists, and scatters the data being read by calling memmove.

There are two alternatives that we might consider:

  • Introduce a new callback to read payload into vectors of vectors (i.e., like readv) that match to the payload section of multiple STREAM frames being generated.
  • Let the application provide a pointer to a contiguous temporary buffer that holds data to be sent, and scatter that.

It might turn out that we'd want to try these alternatives, but they require changes to the API. Therefore, as the first cut, we are trying the approach using memmove.

@kazuho kazuho force-pushed the kazuho/scatter-stream branch from 0148c52 to 423532e Compare June 16, 2025 07:10
@kazuho
Copy link
Copy Markdown
Member Author

kazuho commented Dec 4, 2025

Performance analysis for using memmove:

A tiny benchmark on Zen 3 (Ryzen 7 5700G) tells us that, for each type of copy size and method, following clocks are needed.

1400B * 4 rep movsb 1400B rep movsb 1400B memmovea
294 clks 92 clks 74s

Assume we are building 4 datagrams at once. If we interpret these numbers naively, it means that the copying overhead of using read and memmove is 516 clocks combined (294 + 74 * 3), while that of readv is 368 clocks (92 * 4).

However, if we convert these numbers to per-byte overhead, the difference is 0.026 clock / byte, which is pretty small, if not negligible.

Also, rep movsb - the instruction sequence used by the Linux kernel for read and readv - has performance issues that is not visible in this benchmark, they become 30x slower if the source and destination are on different pages but their deltas from the start of the page is below 32 bytesb; see https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515.

To paraphrase, the difference is small and there are unknowns that causes hesitation to change the API.

note a: To emulate the use case, we measured the throughput of memmove doing backward copies with tiny distances between the destination and the source addresses.
note b: The bug report does not clarify the maximum delta for which slowdown is observed, but my benchmarks show that is is when the delta is below 32 bytes.

@kazuho
Copy link
Copy Markdown
Member Author

kazuho commented Jan 14, 2026

Introduce a new callback to read payload into vectors of vectors (i.e., like readv) that match to the payload section of multiple STREAM frames being generated.

FWIW we did try this, however it turned out to be slower, most likely due to the overhead of readv doing scattered reads greater than the cost of quicly memmove-ing the payload.

Comment thread lib/quicly.c
Comment thread lib/quicly.c
/* If only a STREAM frame was to be built but `on_send_emit` returned BLOCKED, we might have built zero frames. Assuming
* that it is rare to see BLOCKED, send a PADDING-only packet (TODO skip sending the packet at all) */
if (s->dst == s->dst_payload_from)
*s->dst++ = QUICLY_FRAME_TYPE_PADDING;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ugly, though, when h2o is the application, it might not matter in practice due to low probability of on_send_emit returning BLOCKED.

Do we want to fix the TODO before merging this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant