Skip to content

Track merging datatree into xarray #8572

@TomNicholas

Description

@TomNicholas
Member

What is your issue?

Master issue to track progress of merging xarray-datatree into xarray main. Would close #4118 (and many similar issues), as well as one of the goals of our development roadmap.

Also see the project board for DataTree integration.


On calls in the last few dev meetings, we decided to forget about a temporary cross-repo from xarray import datatree (so this issue supercedes #7418), and just begin merging datatree into xarray main directly.

Weekly meeting

See #8747

Task list:

To happen in order:

  • open_datatree in xarray. This doesn't need to be performant initially, and it would initially return a datatree.DataTree object. EDIT: We decided it should return an xarray.DataTree object, or even xarray.core.datatree.DataTree object. So we can start by just copying the basic version in datatree/io.py right now which just calls open_dataset many times. add open_datatree to xarray #8697

    Triage and fix issues: figure out which of the issues on xarray-contrib/datatree need to be fixed before the merge (if any).

  • Merge in code for DataTree class. I suggest we do this by making one PR for each module, and ideally discussing and merging each before opening a PR for the next module. (Open to other workflow suggestions though.) The main aim here being lowering the bus factor on the code, confirming high-level design decisions, and improving details of the implementation as it goes in.

    Suggested order of modules to merge:

  • Expose datatree API publicly. Actually expose open_datatree and DataTree in xarray's public API as top-level imports. The full list of things to expose is:

    • open_datatree
    • DataTree
    • map_over_subtree
    • assert_isomorphic
    • register_datatree_accessor
  • Refactor class inheritance - Dataset/DataArray share some mixin classes (e.g. DataWithCoords), and we could probably refactor DataTree to use these too. This is low-priority but would reduce code duplication.

Can happen basically at any time or maybe in parallel with other efforts:


Anyone is welcome to help with any of this, including but not limited to @owenlittlejohns , @eni-awowale, @flamingbear (@etienneschalk maybe?).

cc also @shoyer @keewis for any thoughts as to the process.

Activity

jhamman

jhamman commented on Dec 22, 2023

@jhamman
Member

Copy xarray-contrib/datatree issues over to xarray's main repository. I think this is quite important and worth doing as a record of why decisions were made. (@jhamman and @TomNicholas)

I think this will require temporarily moving the datatree repository to the pydata org then transferring issues one at a time to the xarray repo. I can help with the repo move when the time comes.

Look into merging commit history of xarray-contrib/datatree. I think this would be cool but is less important than keeping the issues. (@jhamman suggested we could do this using some git wizardry that I hadn't heard of before)

There are various ways to do this and I think it would be worth attempting. It would help preserve some of the iteration that datatree went through and make sure the attribution is carried through. This blog post explains one way to do this: https://gfscott.com/blog/merge-git-repos-and-keep-commit-history/

TomNicholas

TomNicholas commented on Dec 22, 2023

@TomNicholas
MemberAuthor

It would help preserve some of the iteration that datatree went through and make sure the attribution is carried through.

That would be nice! But at least this method merges the entire history in one go it seems. What would our process of feedback be in that case? I'm worried about just merging the whole thing in and everyone just being like "yeah looks good 👍" without anyone else actually understanding how the code works...

keewis

keewis commented on Dec 22, 2023

@keewis
Collaborator

you could create a new feature branch on the xarray repo (just to be safe) and put the datatree code in a "staging" area. Then copying over the modules one by one might work? Not sure if that breaks git blame, though.

Edit: the merge of that feature branch into main should not be a squash-merge though

lsterzinger

lsterzinger commented on Dec 26, 2023

@lsterzinger

Thanks for putting this together @TomNicholas

Happy to help out with this however I can. Like I mentioned in the meeting last week, I'm not super familiar with the xarray backend but definitely willing to learn.

owenlittlejohns

owenlittlejohns commented on Dec 26, 2023

@owenlittlejohns
Contributor

open_datatree in xarray. This doesn't need to be performant initially, and it would initially return a datatree.DataTree object. So we can start by just copying the basic version in datatree/io.py right now which just calls open_dataset many times.

I was taking a quick look at this. Are you essentially saying we just need to copy the contents of datatree/io.py into xarray/backends/api.py (plus necessary tests to equivalent places)? Or do some of the things in io.py need to be migrated somewhere lower level than api.py?

TomNicholas

TomNicholas commented on Dec 26, 2023

@TomNicholas
MemberAuthor

Are you essentially saying we just need to copy the contents of datatree/io.py into xarray/backends/api.py (plus necessary tests to equivalent places)? Or do some of the things in io.py need to be migrated somewhere lower level than api.py?

There are 3 levels of integration at which we could do this:

  1. Add a standalone open_datatree function to xarray/backends/api.py that directly copies code from datatree/io.py.
  2. Add an open_datatree method to xarray/backends/common.py::BackendEntrypoint and implement it for Zarr and netCDF backends using the same approach as in datatree/io.py (i.e. calling open_dataset once for each group).
  3. Same as (2) but refactor the method to only open the file once.

I think you should try (2), falling back to (1) if that's too tricky, but deliberately leave (3) for a later PR.

eschalkargans

eschalkargans commented on Jan 3, 2024

@eschalkargans

Hello,

If it can help, I found myself in a situation, quite similar however slightly different, where I had to merge two repos A and B into one (keeping A and archiving B), moving contents of A and B into new subfolders of the A repo, eg A/a, A/b. This differs in that we don't want to put the xarray code into a new xarray subdirectory. But maybe the procedure would be hopefully similar.

Here is a gist summarizing my procedure to do so: https://gist.github.com/eschalkargans/318d83e58d63d83454d1f8a497786a8d

keewis

keewis commented on Jan 17, 2024

@keewis
Collaborator

I tried my hand at doing the merge, here's the result: datatree on keewis/xarray. This required two extra commits: one for moving the whole repository to a subdirectory (xarray/datatree_, note the underscore to mark it as temporary), and one merge commit.

If anyone wants to try, after adding xarray-contrib/datatree as a remote named datatree and switching to a feature branch I called:

git merge datatree/prepare-for-migration --no-commit --allow-unrelated-histories

where datatree/prepare-for-migration contains the commit moving the repository to the subdirectory.

flamingbear

flamingbear commented on Jan 18, 2024

@flamingbear
Member

So I wasn't able to come up with anything more clever1 than what is above. If bringing the history over in one go to a temporary location is fine, I presume renaming the files into locations as we migrate will preserve the history as we move forward. I guess the next question is whether to feature branch after the import and merge to main or to try to do the migration steps into main proper? Pros and cons to both, but would lean towards directly into main. Thoughts/feelings?

  1. I wanted to bring in the history with each PR to main, but after actually thinking about how that would work, it doesn't.
TomNicholas

TomNicholas commented on Jan 18, 2024

@TomNicholas
MemberAuthor

Thanks all three of you for trying this!

I tried my hand at doing the merge, here's the result: datatree on keewis/xarray. This required two extra commits: one for moving the whole repository to a subdirectory (xarray/datatree_, note the underscore to mark it as temporary), and one merge commit.

preserve the history

Is it just me or has this approach not actually preserved the history at all? All the datatree code seems to be squashed into one massive commit: keewis@4227b38

I guess the next question is whether to feature branch after the import and merge to main or to try to do the migration steps into main proper?

I think we don't really need a feature branch as there are no backwards compatibility issues? (@shoyer unless you have a preference?) Also the datatree code can be merged into main without actually exposing DataTree as public API until we're ready.


@eschalkargans we appreciate your interest in this! FYI the rest of us on this thread met yesterday in xarray's bi-weekly community dev call, which you would be more than welcome to join for. 😄 Regardless we will try to write out any decisions we make in that meeting also on github for reference.

40 remaining items

TomNicholas

TomNicholas commented on Oct 24, 2024

@TomNicholas
MemberAuthor

Release ToDos for myself:

  • Make sure the critical outstanding PRs are merged:
  • Release xarray, with zarr v3 and datatree announcements in release notes. Maybe actually do an email announcement this time too.
    Create datatree announcement discussion on xarray, linking to the migration guide
    Finish and merge Warning about repo being archived xarray-contrib/datatree#344, ensuring the warning points to the xarray discussion
    Release xarray-contrib/datatree one final time
    Archive xarray-contrib/datatree
    Update xarray discussion to say that original repo has been archived
    Tweet?

cc @keewis

TomNicholas

TomNicholas commented on Oct 24, 2024

@TomNicholas
MemberAuthor

Released! 🍾

Thank you so much everyone - anything else that comes up can go in a new issue.

(closed by https://github.com/pydata/xarray/releases/tag/v2024.10.0 and #9680)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @flamingbear@jhamman@kmuehlbauer@lsterzinger@owenlittlejohns

        Issue actions

          Track merging datatree into xarray · Issue #8572 · pydata/xarray