-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[rfc][rip] dedup serdes #32757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[rfc][rip] dedup serdes #32757
Conversation
6e48d9c to
3e0ba57
Compare
|
very interesting... will want to roll out very carefully so that we can roll back to a previous release if needed! |
alangenfeld
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yowzers
First reaction is that this is effectively a whole new serialization format so we should make sure we dont want to target something totally new instead of starting with serdes. If we do want to iterate on serdes we should get in any other improvements that are worth making.
An interesting aspect of this direction since we target immutable objects is that we dedupe in the serialized contents but also the deserialized objects in memory can point at the same deduped instance. We might need to lock down dataclass to only be ok with frozen accordingly.
Personally I am probably too easily swayed in to fun complicated serdes bullshit so will want to get some other opinions in the mix if we want to actually go in this direction.
| if dedup_context is not None and is_whitelisted_for_serdes_object(val): | ||
| try: | ||
| obj_id = hash((type(val), val)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This current approach hashes as we descend, so we end up with the largest possible subtrees deduped but none of the contents in those subtree deduped. I am curious what the difference in performance / size would be if we try to maximize deduping by handling as we come back up.
Can imagine going as far as having the resulting end format just be one dictionary of id -> objects and a root id to start with. Can also imagine then packing all the the objects of the same type together in like a array-of-objects -> object-of-arrays transform to avoid duping class and field names repeatedly.
Might make sense to only target @record if being able to do custom stuff in those objects is advantageous
| except TypeError: | ||
| # unhashable object | ||
| obj_id = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i could see these exceptions being costly perf wise if they occur enough

Summary & Motivation
This is a proof of concept that allows us to detect repeated objects in when serializing them and store a lazy reference to them in a separate mapping object.
This results in faster serialization AND deserialization for large objects -- you pay a little bit of a tax to do the hashing of the objects while serializing, but you earn it all back by having to do way less transformation / serialization. This sort of thing is of course only advantageous for particular types of objects that are prone to high levels of repetition (the asset daemon cursor is the main example of this but I imagine we could find others in the codebase).
It also massively decreases the size of the deserialized object (nearly half for the example I had).
The rough algorithm is to hash objects as they come in and see if we've seen it before. If so, we get the packed representation of that object and store it in a global map in the context. Then then next time we see that we can just directly sub in the packed representation. This means that:
Some stats:
Confirmed that this results in exact equality of the deserialized object after a round trip.
How I Tested These Changes
Changelog
NOCHANGELOG