Create a bytecode interpreter to compactly represent text/JSON name maps. #1789

allevato · 2025-05-29T16:49:54Z

This is something I've been cooking for a bit, as a way to reduce codegen.

The problem with the ExpressibleByDictionaryLiteral approach we're using today is that its codegen is unpredictable. For what is really just static data, the Swift compiler always generates sequences of instructions in debug mode, and in release mode it may or may not be outlined into global data depending on various things outside of the author's control.

StaticStrings, on the other hand, are super predictable since they always become static read-only data (ignoring the oddball edge case involving a single code point). So, why not do what some other languages do and create blobs of binary data that we can process at runtime instead?

The catch is that since StaticStrings must be valid UTF-8, we have to invent our own little bytecode format a bit, to avoid encodings that might be invalid. The format is a stream of bytes as follows:

Integers are represented as varints with 6-bit shifts and an always cleared MSB. This is similar to protobuf's own 7-bit varints except we lose a bit to ensure that continuation bytes are in the range 0x40–0x7F instead of 0x80–0xFF.
Strings are represented as null-terminated UTF-8.
String arrays are represented by a count integer followed by that many null-terminated UTF-8 strings.
The stream always starts with an integer "version" number, which future-proofs us if we want to change the encoding. Right now, it's always zero.

Layered on top of the bytecode stream is an interpreter, which abstracts "instructions" that are RawRepresentable as UInt64, and the interpreter loop just reads the next instruction and invokes a callback that is supposed to read and process the operands.

The bytecode is an "ABI" that we can add to, but never remove from (for a particular major version of the runtime). For example, we could add new instructions in the future to represent things differently, as long as the runtime can still correctly interpret the bytecode from older generated code.

There are definitely more optimizations that we could do to the representation here to make it more compact. The most obvious thing would be to add specialized opcodes for the most common cases. Other things we could do would be to sort the field numbers in numerical order and represent them as adjacent differences instead of their absolute value—that would shrink things a bit for dense messages/enums that have numbers greater than 63.

But for now, I just wanted to put up this draft as a vibe check. I took one of our particularly large transitive proto closures internally and measured the optimized and dead-stripped output with and without this change, using a tiny binary that just imports a huge proto module, sets a single field in a message, and then black-holes the message:

Before:  3995696 bytes
After:   3823880 bytes
Diff:     171816 bytes (–4.3%)

But I think this is just the beginning of what we can do—namely, we could refactor some of the generated message storage internals and make visitation completely generalized over a similar bytecode that we would generate about a message's layout. That, I imagine, would yield far greater codegen savings (and likely significant compile time savings as well).

Side note: Many of the APIs presented by the bytecode interpreter and the generated code just scream for Span and UTF8Span. I suppose one day we'll have a high enough deployment target that we could use those 😭

This will be used to generate data compactly as static strings, which optimize much more consistently than some of the code we generate today like the dictionary literal used for `NameMap`s. Later on, I'd also like to look at how we could use this to tabularize parsing and serialization. With some other refactoring, I think we could get to a point where the visitation APIs are completely driven by bytecode strings and we wouldn't have to generate the large message-specific methods that do this.

to generate it.

Sources/SwiftProtobuf/BytecodeInterpreter.swift

thomasvl · 2025-05-29T17:07:37Z

Sources/SwiftProtobuf/BytecodeInterpreter.swift

+            // The only way this could happen is if the program were a single byte, meaning that it
+            // only has a 6-bits-or-fewer format specifier and nothing else. In other words, there
+            // are no instructions, and we can simply return as there is nothing to execute.
+            return


I still wonder if this should fatalError since it shouldn't happen and if something were generated with one character something went wrong?

An empty message produces a single code-point name map, since it's just the version prefix and then nothing else: "\u{0}"

That should just be _NameMap() like it is now, no?

We could, but do we care enough about special casing that in the generator? The overhead of interpreting the empty name map bytecode would be low, especially for such an uncommon situation.

It is somewhat common for things that use extensions over fields as a design pattern (don't ask), so it does happen. The main "savings" from my POV is just less "startup code" that has to run, yes, it's is less, but zero is always less then "a little" 😄

ps - it's also even less for the compiler to look at when type checking the compile. 😉

That's fair, and a little extra complexity in the generator for more savings at runtime is probably the right choice in general.

thomasvl · 2025-05-29T17:15:11Z

Sources/SwiftProtobufPluginLibrary/BytecodeWriter.swift

+    ///
+    /// - Parameter programFormat: The program format of the bytecode stream to write. Currently,
+    ///   the only valid value is zero.
+    public init(programFormat: Int = 0) {


Should there be a constant over in the main runtime library for the zero value that is used by these places also?

thomasvl · 2025-05-29T17:15:26Z

Sources/SwiftProtobufPluginLibrary/BytecodeWriter.swift

+    /// - Parameter programFormat: The program format of the bytecode stream to write. Currently,
+    ///   the only valid value is zero.
+    public init(programFormat: Int = 0) {
+        if programFormat != 0 {


thomasvl · 2025-05-29T17:16:35Z

Sources/SwiftProtobuf/BytecodeReader.swift

+            if byte & 0x40 == 0 {
+                return value
+            }
+            shift += 6


some guard that shift doesn't get too large?

thomasvl · 2025-05-29T17:26:10Z

Sources/SwiftProtobuf/BytecodeReader.swift

+    /// - Returns: An array of `UnsafeBufferPointer`s containing the strings that were read from the
+    ///   bytecode stream. See the documentation of `nextString()` for details on the lifetimes of
+    ///   these pointers.
+    package mutating func nextNullTerminatedStringArray() -> [UnsafeBufferPointer<UInt8>] {


Doesn't look like this got a test (or a helper to write one of these)

thomasvl · 2025-05-29T17:27:58Z

Sources/SwiftProtobuf/NameMap.swift

+    /// ## Operands
+    /// * An integer representing the lower bound (inclusive) of the reserved field number range.
+    /// * An integer representing the upper bound (exclusive) of the reserved field number range.
+    case reservedFields = 6


"reservedNumber" nameMap isn't specific to fields, and enums also have reserved numbers, we just haven't needed to capture them.

thomasvl · 2025-05-29T17:33:18Z

Sources/protoc-gen-swift/FieldGenerator.swift

@@ -110,6 +113,50 @@ class FieldGeneratorBase {
        }
    }

+    func writeProtoNameInstruction(to writer: inout ProtoNameInstructionWriter) {


Shouldn't var fieldMapNames come out now?

thomasvl · 2025-05-29T17:34:03Z

Sources/protoc-gen-swift/FieldGenerator.swift

+        }
+        let jsonName = fieldDescriptor.jsonName
+
+        // TODO: Create a separate instruction for the group being able to match


Seems like we should do this extra opcode now vs. keeping it as a TODO

thomasvl · 2025-05-29T17:39:47Z

Sources/SwiftProtobufPluginLibrary/BytecodeWriter.swift

+            if isprint(Int32(truncatingIfNeeded: value)) != 0 {
+                self.append(escapingIfNecessary: UnicodeScalar(UInt32(truncatingIfNeeded: value))!)
+            } else {
+                code.append(String(format: "\\u{%x}", value))


Is it worth picking off \0 as the one special case? (tab, newline, etc don't seem worth it, but if \0 is common, it would slightly shorten strings and there's no code on the decoding side.

allevato added 3 commits May 29, 2025 10:34

Add a bytecode-based NameMap initializer and update the generator

51609a6

to generate it.

Regenerate protos.

074f54e

thomasvl reviewed May 29, 2025

View reviewed changes

allevato mentioned this pull request Jun 16, 2025

Add parameter to control extension output #1795

Closed

Create a bytecode interpreter to compactly represent text/JSON name maps. #1789

Are you sure you want to change the base?

Create a bytecode interpreter to compactly represent text/JSON name maps. #1789

Uh oh!

Conversation

allevato commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

allevato commented May 29, 2025 •

edited

Loading