Skip to content

Conversation

@OwenKephart
Copy link
Contributor

@OwenKephart OwenKephart commented Nov 8, 2025

Summary & Motivation

This is a proof of concept that allows us to detect repeated objects in when serializing them and store a lazy reference to them in a separate mapping object.

This results in faster serialization AND deserialization for large objects -- you pay a little bit of a tax to do the hashing of the objects while serializing, but you earn it all back by having to do way less transformation / serialization. This sort of thing is of course only advantageous for particular types of objects that are prone to high levels of repetition (the asset daemon cursor is the main example of this but I imagine we could find others in the codebase).

It also massively decreases the size of the deserialized object (nearly half for the example I had).

The rough algorithm is to hash objects as they come in and see if we've seen it before. If so, we get the packed representation of that object and store it in a global map in the context. Then then next time we see that we can just directly sub in the packed representation. This means that:

  • If you have exactly one instance of a sub-object -> never goes to the global map
  • If you have exactly two instances of a sub-object -> one instance will be serialized normally, the other will be stored as a reference (so slight size increase, but marginal)
  • If you have three or more instance -> after the first instance, all are stored as references, so you get pretty big savings

Some stats:

INFO:root:Loading cursor from /Users/owen/Downloads/giant_cursor.txt...
INFO:root:  Loaded cursor
INFO:root:Benchmarking serialize_value...
INFO:root:       9.092s
INFO:root:       7.867s
INFO:root:       7.816s
INFO:root:  Results:
INFO:root:    Average time: 8.259s
INFO:root:    Size: 168,867,564 bytes (161.04 MB)
INFO:root:Benchmarking serialize_value_with_dedup...
INFO:root:       6.105s
INFO:root:       6.001s
INFO:root:       6.750s
INFO:root:  Results:
INFO:root:    Average time: 6.285s
INFO:root:    Size: 81,605,028 bytes (77.82 MB)
INFO:root:Benchmarking deserialize_value...
INFO:root:      4.723s
INFO:root:      5.839s
INFO:root:      4.699s
INFO:root:  Results:
INFO:root:    Average time: 5.087s
INFO:root:Benchmarking deserialize_value_with_dedup...
INFO:root:      4.445s
INFO:root:      3.390s
INFO:root:      3.392s
INFO:root:  Results:
INFO:root:    Average time: 3.742s
INFO:root:Checking equality of deserialized values...

Confirmed that this results in exact equality of the deserialized object after a round trip.

How I Tested These Changes

Changelog

NOCHANGELOG

Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@OwenKephart OwenKephart force-pushed the 11-07-_rfc_rip_dedup_serdes branch from 6e48d9c to 3e0ba57 Compare November 8, 2025 00:15
@gibsondan
Copy link
Member

very interesting...

will want to roll out very carefully so that we can roll back to a previous release if needed!

Copy link
Member

@alangenfeld alangenfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yowzers

First reaction is that this is effectively a whole new serialization format so we should make sure we dont want to target something totally new instead of starting with serdes. If we do want to iterate on serdes we should get in any other improvements that are worth making.

An interesting aspect of this direction since we target immutable objects is that we dedupe in the serialized contents but also the deserialized objects in memory can point at the same deduped instance. We might need to lock down dataclass to only be ok with frozen accordingly.

Personally I am probably too easily swayed in to fun complicated serdes bullshit so will want to get some other opinions in the mix if we want to actually go in this direction.

Comment on lines +952 to +954
if dedup_context is not None and is_whitelisted_for_serdes_object(val):
try:
obj_id = hash((type(val), val))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This current approach hashes as we descend, so we end up with the largest possible subtrees deduped but none of the contents in those subtree deduped. I am curious what the difference in performance / size would be if we try to maximize deduping by handling as we come back up.

Can imagine going as far as having the resulting end format just be one dictionary of id -> objects and a root id to start with. Can also imagine then packing all the the objects of the same type together in like a array-of-objects -> object-of-arrays transform to avoid duping class and field names repeatedly.

Might make sense to only target @record if being able to do custom stuff in those objects is advantageous

Comment on lines +955 to +957
except TypeError:
# unhashable object
obj_id = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i could see these exceptions being costly perf wise if they occur enough

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants