-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What is your issue?
Master issue to track progress of merging xarray-datatree into xarray main
. Would close #4118 (and many similar issues), as well as one of the goals of our development roadmap.
Also see the project board for DataTree integration.
On calls in the last few dev meetings, we decided to forget about a temporary cross-repo from xarray import datatree
(so this issue supercedes #7418), and just begin merging datatree into xarray main directly.
Weekly meeting
See #8747
Task list:
To happen in order:
open_datatree
in xarray. This doesn't need to be performant initially, andit would initially return aEDIT: We decided it should return andatatree.DataTree
object.xarray.DataTree
object, or evenxarray.core.datatree.DataTree
object. So we can start by just copying the basic version indatatree/io.py
right now which just callsopen_dataset
many times. add open_datatree to xarray #8697Triage and fix issues: figure out which of the issues on xarray-contrib/datatree need to be fixed before the merge (if any).
Merge in code for
DataTree
class. I suggest we do this by making one PR for each module, and ideally discussing and merging each before opening a PR for the next module. (Open to other workflow suggestions though.) The main aim here being lowering the bus factor on the code, confirming high-level design decisions, and improving details of the implementation as it goes in.Suggested order of modules to merge:
datatree/treenode.py
- defines the tree structure, without any dimensions/data attached, Migrate treenode module. #8757datatree/datatree.py
- adds data to the tree structure, Migrate datatree.py module into xarray.core. #8789datatree/iterators.py
- iterates over a single tree in various ways, currently copied from anytree, Migrate iterators.py for datatree. #8879datatree/mapping.py
- implementsmap_over_subtree
by iterating over N trees at once Migrate datatree mapping.py #8948,datatree/ops.py
- usesmap_over_subtree
to map methods like.mean
over whole trees (Migration of datatree/ops.py -> datatree_ops.py #8976),datatree/formatting_html.py
- HTML repr, works but could do with some optimization Migrate formatting_html.py into xarray core #8930,datatree/{extensions/common}.py
- miscellaneous other features e.g. attribute-like access (Migrate datatreee assertions/extensions/formatting #8967).
To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.Expose datatree API publicly. Actually expose
open_datatree
andDataTree
in xarray's public API as top-level imports. The full list of things to expose is:open_datatree
DataTree
map_over_subtree
assert_isomorphic
register_datatree_accessor
To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.Refactor class inheritance -
Dataset
/DataArray
share some mixin classes (e.g.DataWithCoords
), and we could probably refactorDataTree
to use these too. This is low-priority but would reduce code duplication.To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
Can happen basically at any time or maybe in parallel with other efforts:
- Generalize backends to support groups. Once a basic version of
xr.open_datatree
exists, we can start refactoring xarray's backend classes to support a generalBackend.open_datatree
method for any backend that can open multiple groups. Then we can make sure this is more performant than the naive implementation, i.e. only opening the file once. See also Improving performance of open_datatree #8994.Support backends other than netCDF and Zarr. - e.g. grib, see DRAFT: Implementopen_datatree
in BackendEntrypoint for preliminary DataTree support #7437,Support dask properly - Issue Dask-specific methods xarray-contrib/datatree#97 and the (stale) PR Implement dask-specific methods xarray-contrib/datatree#196 are about dask parallelization over separate nodes in the tree.Add other new high-level API methods - Things like.reorder_nodes
and ideas we've only discussed like API for filtering / subsetting xarray-contrib/datatree#79 and Tree-aware dataset handling/selection xarray-contrib/datatree#254 (cc @dcherian who has had useful ideas here)Copy xarray-contrib/datatree issues over to xarray's main repository. I think this is quite important and worth doing as a record of why decisions were made. (@jhamman and @TomNicholas)Copy over any recent bug fixes from originaldatatree
repositoryLook into merging commit history of xarray-contrib/datatree. I think this would be cool but is less important than keeping the issues. (@jhamman suggested we could do this using some git wizardry that I hadn't heard of before)xarray.tutorial.open_datatree
- I've been meaning to make a tutorial datatree object for ages. There's an issue about it, but actually now I think something close to the CMIP6 ensemble data that @jbusecke and I used in our pangeo blog post would already be pretty good. Once we have this it becomes much easier to write docs about some advanced features.Merge Docs - I've tried to write these pages so that they should slot neatly into xarray's existing docs structure. Careful reading, additions and improvements would be great though. Summary of what docs exist on this issue Documentation plans xarray-contrib/datatree#61Write a blog post on the xarray blog highlighting xarray's new functionality, and explicitly thanking the NASA team for their work. Doesn't have to be long, it can just point to the documentation. DataTree release blog post xarray-contrib/xarray.dev#708To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
Anyone is welcome to help with any of this, including but not limited to @owenlittlejohns , @eni-awowale, @flamingbear (@etienneschalk maybe?).
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Activity
jhamman commentedon Dec 22, 2023
I think this will require temporarily moving the datatree repository to the pydata org then transferring issues one at a time to the xarray repo. I can help with the repo move when the time comes.
There are various ways to do this and I think it would be worth attempting. It would help preserve some of the iteration that datatree went through and make sure the attribution is carried through. This blog post explains one way to do this: https://gfscott.com/blog/merge-git-repos-and-keep-commit-history/
TomNicholas commentedon Dec 22, 2023
That would be nice! But at least this method merges the entire history in one go it seems. What would our process of feedback be in that case? I'm worried about just merging the whole thing in and everyone just being like "yeah looks good 👍" without anyone else actually understanding how the code works...
keewis commentedon Dec 22, 2023
you could create a new feature branch on the xarray repo (just to be safe) and put the datatree code in a "staging" area. Then copying over the modules one by one might work? Not sure if that breaks
git blame
, though.Edit: the merge of that feature branch into
main
should not be a squash-merge thoughlsterzinger commentedon Dec 26, 2023
Thanks for putting this together @TomNicholas
Happy to help out with this however I can. Like I mentioned in the meeting last week, I'm not super familiar with the xarray backend but definitely willing to learn.
owenlittlejohns commentedon Dec 26, 2023
I was taking a quick look at this. Are you essentially saying we just need to copy the contents of datatree/io.py into xarray/backends/api.py (plus necessary tests to equivalent places)? Or do some of the things in
io.py
need to be migrated somewhere lower level thanapi.py
?TomNicholas commentedon Dec 26, 2023
There are 3 levels of integration at which we could do this:
open_datatree
function toxarray/backends/api.py
that directly copies code fromdatatree/io.py
.open_datatree
method toxarray/backends/common.py::BackendEntrypoint
and implement it for Zarr and netCDF backends using the same approach as indatatree/io.py
(i.e. callingopen_dataset
once for each group).I think you should try (2), falling back to (1) if that's too tricky, but deliberately leave (3) for a later PR.
eschalkargans commentedon Jan 3, 2024
Hello,
If it can help, I found myself in a situation, quite similar however slightly different, where I had to merge two repos A and B into one (keeping A and archiving B), moving contents of A and B into new subfolders of the A repo, eg
A/a
,A/b
. This differs in that we don't want to put thexarray
code into a newxarray
subdirectory. But maybe the procedure would be hopefully similar.Here is a gist summarizing my procedure to do so: https://gist.github.com/eschalkargans/318d83e58d63d83454d1f8a497786a8d
keewis commentedon Jan 17, 2024
I tried my hand at doing the merge, here's the result:
datatree
on keewis/xarray. This required two extra commits: one for moving the whole repository to a subdirectory (xarray/datatree_
, note the underscore to mark it as temporary), and one merge commit.If anyone wants to try, after adding
xarray-contrib/datatree
as a remote nameddatatree
and switching to a feature branch I called:where
datatree/prepare-for-migration
contains the commit moving the repository to the subdirectory.flamingbear commentedon Jan 18, 2024
So I wasn't able to come up with anything more clever1 than what is above. If bringing the history over in one go to a temporary location is fine, I presume renaming the files into locations as we migrate will preserve the history as we move forward. I guess the next question is whether to feature branch after the import and merge to main or to try to do the migration steps into main proper? Pros and cons to both, but would lean towards directly into main. Thoughts/feelings?
TomNicholas commentedon Jan 18, 2024
Thanks all three of you for trying this!
Is it just me or has this approach not actually preserved the history at all? All the datatree code seems to be squashed into one massive commit: keewis@4227b38
I think we don't really need a feature branch as there are no backwards compatibility issues? (@shoyer unless you have a preference?) Also the datatree code can be merged into main without actually exposing
DataTree
as public API until we're ready.@eschalkargans we appreciate your interest in this! FYI the rest of us on this thread met yesterday in xarray's bi-weekly community dev call, which you would be more than welcome to join for. 😄 Regardless we will try to write out any decisions we make in that meeting also on github for reference.
40 remaining items
open_groups
for zarr backends #9469TomNicholas commentedon Oct 24, 2024
Release ToDos for myself:
open_datatree
#9666dask
methods onDataTree
#9670cc @keewis
TomNicholas commentedon Oct 24, 2024
Released! 🍾
Thank you so much everyone - anything else that comes up can go in a new issue.
(closed by https://github.com/pydata/xarray/releases/tag/v2024.10.0 and #9680)