Skip to content

Introduce Dedicated Wire Protocol Subsystem for Message Parsing and Serialization #298

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

niclaflamme
Copy link
Contributor

@niclaflamme niclaflamme commented Aug 1, 2025

Motivation: The Current State

Currently, the responsibility for parsing and serializing PostgreSQL wire protocol messages is distributed across various parts of the application. This leads to ad-hoc handling in places like the backend and frontend logic, where we often resort to low-level byte manipulations directly on streams or buffers.

  • For example, in admin/backend.rs, we check message.code() != 'Q' to validate message types, ie. byte inspections, mixing concerns between protocol decoding and business logic.

  • Similarly, scattered checks like peeking at individual bytes (e.g., if buf[0] == b'Z') for ReadyForQuery messages appear in connection handling, making the code harder to maintain, debug, and extend.

While this approach has worked for our needs so far, it scatters protocol knowledge throughout the codebase, increasing the risk of inconsistencies, bugs in edge cases (like malformed packets), and making it tougher for new contributors to grasp the flow.

We should rein in complexity where we can.

The Proposed Change: A Dedicated wire_protocol Module

I believe it's a natural evolution for PgDog, to introduce a dedicated subsystem focused solely on translating raw TCP byte streams into a stream of structured wire messages (and vice versa). This PR adds a new wire_protocol module that centralizes all protocol parsing and serialization logic.

Note that this PR focuses solely on creating the module and defining the message structures—no existing code has been migrated to use these messages yet; that's planned for follow-up PRs.

Key components:

  • Unified Message Types: Introduces FrontendProtocolMessage and BackendProtocolMessage enums to represent all supported messages in a type-safe way. This builds toward a split ProtocolMessage abstraction for bidirectional handling.
  • WireSerializable Trait: A common interface for messages to implement to_bytes() and from_bytes(), ensuring consistent encoding/decoding without leaking byte-level details elsewhere.
  • Modular Structure: Submodules for frontend, backend, shared_property_types, and helpers, covering messages like Startup, Query, Bind, Authentication, etc.

Importantly, the PostgreSQL wire protocol is a finite, well-established standard. Once this subsystem is fully implemented, it should be largely "done"—requiring zero future changes beyond occasional updates for new protocol versions or extensions.

Benefits

  • Type Safety and Readability: The Rust type system now guides us. Instead of cryptic checks like if message[7] == -1 (e.g., for null indicators in parameter values), we can work with expressive types like BindFrame { parameters: Vec<Parameter> } where Parameter::Binary or Parameter::Text clearly convey intent. This reduces errors and makes intent obvious at a glance.
  • Clear Boundaries: We now strictly separate concerns: outside the wire_protocol module, we deal only in high-level messages; inside, we handle bytes. No more half-parsed buffers floating around—this minimizes partial states and simplifies testing (e.g., unit tests for individual message roundtrips are now straightforward).
  • Maintainability and Extensibility: Centralizing protocol logic makes it easier to support new features (like additional auth methods or extended query modes) without touching unrelated parts of the app. It also positions us better as more than just a PgBouncer alternative, toward a robust, unified protocol handling system.
  • Performance: By leveraging bytes::Bytes and borrowing where possible, we avoid unnecessary allocations in hot paths, though I've erred on the side of borrowing for now (more on this below).
  • Misc: Things like supporting multiple ; separated queries could be done at this level. Q "SELECT 1;SELECT 2;" and not leak further into the code that we don't support double queries. (I don't know).

Overall, this feels like a confident step toward making PgDog more robust and developer-friendly, without overcomplicating things.

Notes and Caveats

  • Rust Newbie Considerations: I'm still building my Rust skills, so I've leaned heavily on borrowing (&'a [u8], etc.) to minimize allocations. If this causes lifetime headaches or perf issues down the line, switching to owned data (e.g., Vec<u8> for payloads) should be a straightforward refactor—happy to iterate based on feedback!
  • Scope: This PR doesn't migrate any message handler yet; it's just definition.
  • Testing: Added unit tests for roundtrip serialization/deserialization of key messages, including error cases. Technically if both the assumption and the tests are wrong, these will falsely pass. Integrations test should immediately catch them when ProtocolMessage is using in application logic.
  • Potential Drawbacks: Slightly more code in the short term, but the long-term wins in clarity outweigh that.

Next Steps: Building the Full Pipeline

This subsystem lays the foundation for a structured protocol processing pipeline, enabling more sophisticated query routing, sharding, and interception in PgDog. Future work will integrate it into a multi-stage flow:

  • Bytes to ProtocolMessages: Introduce a parser function (e.g., parse_stream) that consumes a BytesStream (raw TCP input) and yields a stream of ProtocolMessage instances, handling fragmentation and errors.
  • ProtocolMessages to Sequences: Group messages into RequestSequence or ResponseSequence structs, accumulating until terminators like Sync or Flush. This adds stateful validation without embedding logic in handlers.
  • Sequences to Commands: Validate and label sequences into high-level Command enums (e.g., Query, Prepare), inferring intent while keeping them agnostic to execution details.
  • Commands to ExecutionPlan: Generate an ExecutionPlan for each command, specifying routing (shards/pools), modifications, and backend interactions—replacing current branching in functions like client_messages.
  • Responses and Aggregation: Mirror the flow for backends with ShardedResponse and AggregationPlan, enabling result merging, error reconciliation, and schema checks. Merge Sort + Stream as much as possible.

I'd love feedback on this approach—does it align with where we want to take PgDog? If there are better patterns or oversights, I'm all ears! 🚀

I did this overnight with a very tired brain, I might not have been operating at full mental capacity.

A lot of this might be cope for my smaller working memory, but I can't be the only one who would benefit from not interrupting my code scans to go find out what message.byte()[7] == -1 means conceptually.

@levkk
Copy link
Collaborator

levkk commented Aug 1, 2025

For example, in admin/backend.rs, we check message.code() != 'Q' to validate message types, ie. byte inspections, mixing concerns between protocol decoding and business logic.

Similarly, scattered checks like peeking at individual bytes (e.g., if buf[0] == b'Z') for ReadyForQuery messages appear in connection handling, making the code harder to maintain, debug, and extend.

Deserialization is relatively expensive, that's why we sometimes peek into the message to avoid it if we can.

While this approach has worked for our needs so far, it scatters protocol knowledge throughout the codebase, increasing the risk of inconsistencies, bugs in edge cases (like malformed packets), and making it tougher for new contributors to grasp the flow.

Do you have an example of this happening? We mostly use the structs in crate::net to create messages, those are all tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants