[rfc][rip] dedup serdes #32757

OwenKephart · 2025-11-08T00:02:42Z

Summary & Motivation

This is a proof of concept that allows us to detect repeated objects in when serializing them and store a lazy reference to them in a separate mapping object.

This results in faster serialization AND deserialization for large objects -- you pay a little bit of a tax to do the hashing of the objects while serializing, but you earn it all back by having to do way less transformation / serialization. This sort of thing is of course only advantageous for particular types of objects that are prone to high levels of repetition (the asset daemon cursor is the main example of this but I imagine we could find others in the codebase).

It also massively decreases the size of the deserialized object (nearly half for the example I had).

The rough algorithm is to hash objects as they come in and see if we've seen it before. If so, we get the packed representation of that object and store it in a global map in the context. Then then next time we see that we can just directly sub in the packed representation. This means that:

If you have exactly one instance of a sub-object -> never goes to the global map
If you have exactly two instances of a sub-object -> one instance will be serialized normally, the other will be stored as a reference (so slight size increase, but marginal)
If you have three or more instance -> after the first instance, all are stored as references, so you get pretty big savings

Some stats:

INFO:root:Loading cursor from /Users/owen/Downloads/giant_cursor.txt...
INFO:root:  Loaded cursor
INFO:root:Benchmarking serialize_value...
INFO:root:       9.092s
INFO:root:       7.867s
INFO:root:       7.816s
INFO:root:  Results:
INFO:root:    Average time: 8.259s
INFO:root:    Size: 168,867,564 bytes (161.04 MB)
INFO:root:Benchmarking serialize_value_with_dedup...
INFO:root:       6.105s
INFO:root:       6.001s
INFO:root:       6.750s
INFO:root:  Results:
INFO:root:    Average time: 6.285s
INFO:root:    Size: 81,605,028 bytes (77.82 MB)
INFO:root:Benchmarking deserialize_value...
INFO:root:      4.723s
INFO:root:      5.839s
INFO:root:      4.699s
INFO:root:  Results:
INFO:root:    Average time: 5.087s
INFO:root:Benchmarking deserialize_value_with_dedup...
INFO:root:      4.445s
INFO:root:      3.390s
INFO:root:      3.392s
INFO:root:  Results:
INFO:root:    Average time: 3.742s
INFO:root:Checking equality of deserialized values...

Confirmed that this results in exact equality of the deserialized object after a round trip.

How I Tested These Changes

Changelog

NOCHANGELOG

OwenKephart · 2025-11-08T00:03:01Z

[rfc][rip] dedup serdes #32757 👈 (View in Graphite)
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

gibsondan · 2025-11-10T13:00:18Z

very interesting...

will want to roll out very carefully so that we can roll back to a previous release if needed!

alangenfeld

yowzers

First reaction is that this is effectively a whole new serialization format so we should make sure we dont want to target something totally new instead of starting with serdes. If we do want to iterate on serdes we should get in any other improvements that are worth making.

An interesting aspect of this direction since we target immutable objects is that we dedupe in the serialized contents but also the deserialized objects in memory can point at the same deduped instance. We might need to lock down dataclass to only be ok with frozen accordingly.

Personally I am probably too easily swayed in to fun complicated serdes bullshit so will want to get some other opinions in the mix if we want to actually go in this direction.

alangenfeld · 2025-11-10T16:14:22Z

python_modules/libraries/dagster-shared/dagster_shared/serdes/serdes.py

+    if dedup_context is not None and is_whitelisted_for_serdes_object(val):
+        try:
+            obj_id = hash((type(val), val))


This current approach hashes as we descend, so we end up with the largest possible subtrees deduped but none of the contents in those subtree deduped. I am curious what the difference in performance / size would be if we try to maximize deduping by handling as we come back up.

Can imagine going as far as having the resulting end format just be one dictionary of id -> objects and a root id to start with. Can also imagine then packing all the the objects of the same type together in like a array-of-objects -> object-of-arrays transform to avoid duping class and field names repeatedly.

Might make sense to only target @record if being able to do custom stuff in those objects is advantageous

alangenfeld · 2025-11-10T16:21:08Z

python_modules/libraries/dagster-shared/dagster_shared/serdes/serdes.py

+        except TypeError:
+            # unhashable object
+            obj_id = None


i could see these exceptions being costly perf wise if they occur enough

[rfc][rip] dedup serdes

3e0ba57

OwenKephart force-pushed the 11-07-_rfc_rip_dedup_serdes branch from 6e48d9c to 3e0ba57 Compare November 8, 2025 00:15

OwenKephart requested review from alangenfeld and gibsondan November 8, 2025 00:17

alangenfeld reviewed Nov 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rfc][rip] dedup serdes #32757

[rfc][rip] dedup serdes #32757

OwenKephart commented Nov 8, 2025 •

edited

Loading

Uh oh!

OwenKephart commented Nov 8, 2025

Uh oh!

gibsondan commented Nov 10, 2025

Uh oh!

alangenfeld left a comment

Uh oh!

alangenfeld Nov 10, 2025

Uh oh!

alangenfeld Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[rfc][rip] dedup serdes #32757

Are you sure you want to change the base?

[rfc][rip] dedup serdes #32757

Conversation

OwenKephart commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary & Motivation

How I Tested These Changes

Changelog

Uh oh!

OwenKephart commented Nov 8, 2025

Uh oh!

gibsondan commented Nov 10, 2025

Uh oh!

alangenfeld left a comment

Choose a reason for hiding this comment

Uh oh!

alangenfeld Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

alangenfeld Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

OwenKephart commented Nov 8, 2025 •

edited

Loading