S3 IO Manager and Data Version for reusing past versions of upstream artifacts. #31785

sqr00t · 2025-08-15T01:24:56Z

sqr00t
Aug 15, 2025

When using S3PickleIOManager, how do I set the name of the pickled object at output?

Subsequent materializations of an asset will overwrite previous materializations of that asset. With a base directory of "/my/base/path", an asset with key AssetKey(["one", "two", "three"]) would be stored in a file called "three" in a directory with path "/my/base/path/one/two/".

I want to re-materilise downstream assets using previous versions of upstream assets, but since an upstream asset will have been overwritten, this won't be possible. For example, if I wanted to examine the effect on the downstream asset, due to using a different sample size or subset of the upstream asset.

I currently hash an upstream DataFrame asset and add it's hash as DataVersion metadata. Along the lines of:

    return dg.Output(df, data_version=dg.DataVersion(hash_dataframe(df)))

I am thinking to somehow pass the exact hash value string into run configs based on dg.Config, and I can have some other function that loads a past upstream asset name. I can't seem to find in the docs some way to override the name of the pickled artefact. I have a hunch I'd have to implement my own IO manager based on S3PickleIOManager, but it's not very clear how to add the output data_version to the output key.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

S3 IO Manager and Data Version for reusing past versions of upstream artifacts. #31785

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

S3 IO Manager and Data Version for reusing past versions of upstream artifacts. #31785

Uh oh!

sqr00t Aug 15, 2025

Replies: 0 comments

sqr00t
Aug 15, 2025