Skip to content

protobuf framing: support varint prefix (wire format) #20156

@lspgn

Description

@lspgn

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

Thank you for this software!

When source or sinks make use of protobuf encoding/decoding, the ability to decode protowire is missing.
When serializing protobuf, the go official library is suggesting to prefix them with a varint, treating the message like another nested message (without tag though).

Some tools like ClickHouse are making use of length prefixed messages (eg: when consuming from Kafka):

ClickHouse inputs and outputs protobuf messages in the length-delimited format. It means before every message should be written its length as a varint. See also how to read/write length-delimited protobuf messages in popular languages.

I would like to suggest adding such framing option.

Attempted Solutions

Currently, Vector offers two ways of decoding protobuf with framing: byte or length_delimited.

In certain cases, the source making use of a byte framing (eg: the buffer in a socket, file sources), there are risks a protobuf message may be "cut" or skipped (2 batched messages, only first one is decoded, rest is discarded).
Furthermore, a default/zero-length protobuf would be missed.

The length_delimited setting is not necessarily standard for protobuf and is not retro-compatible with varint.

sources:
  example:
    type: socket
    mode: unix_stream
    path: "mysock.socket"
    decoding:
      codec: protobuf
      protobuf:
        desc_file: "abc.desc"
        message_type: "abc.ABC"
    framing:
      method: length_delimited # needs a uint32 prefix

Unfortunately, it's not possible to create a "wrapper" protobuf message since the tag (1 in the example below) must be encoded as well as varint:

message DEF {
  repeated ABC abc = 1;
}

Proposal

My suggestion would be the following for sources and sinks.

Either having the protobuf decoder assume it will read a varint and consider it a length. This said, not sure if this could be one-to-many way of decoding messages (+ waiting for the rest of the bytes).

sources:
  example:
    ...
    decoding:
      codec: protobuf
      protobuf:
        desc_file: "abc.desc"
        message_type: "abc.ABC"
        protowire: true
    framing:
      method: byte

or having a proper varint in framing:

sources:
  example:
    ...
    decoding:
      codec: protobuf
      protobuf:
        desc_file: "abc.desc"
        message_type: "abc.ABC"
    framing:
      method: varint

Thank you!

References

No response

Version

vector 0.36.1 (2857180 2024-03-11 14:32:52.417737479)

Metadata

Metadata

Assignees

No one assigned

    Labels

    domain: codecsAnything related to Vector's codecs (encoding/decoding)type: featureA value-adding code addition that introduce new functionality.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions