Skip to content

healpix moc index #151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 57 commits into
base: main
Choose a base branch
from
Open

healpix moc index #151

wants to merge 57 commits into from

Conversation

keewis
Copy link
Collaborator

@keewis keewis commented Jun 3, 2025

Note: This is a highly experimental PR (and a lot still has to be done for this to be usable)

This tries to use the RangeMOCIndex class from EOPF-DGGS/healpix-geo#31 instead of the standard pandas index.

In theory, this should allow opening and decoding datasets up to level 29 without having to keep the cell ids in memory (and quite possibly we can skip loading the cell ids if we see that we have $12 \cdot 4^{\mathrm{level}}$ cells).

The current state does not support dask (so the cell ids will still be kept in memory), but this is definitely something that will be added as the PR progresses – I still have to understand how the coordinate transform indexes work in xarray.

cc @benbovy

@benbovy
Copy link
Member

benbovy commented Jun 4, 2025

Nice!

Not sure this PR fully closes #143, though, since this is Healpix specific.

The current state does not support dask (so the cell ids will still be kept in memory), but this is definitely something that will be added as the PR progresses – I still have to understand how the coordinate transform indexes work in xarray.

I'll take a stab at adding coordinate transforms in a dggs-agnostic way in another PR (at least for lat/lon auxiliary coordinates), hopefully this will provide a good baseline for adding a lazy coordinate for cell ids here.

@keewis
Copy link
Collaborator Author

keewis commented Jun 11, 2025

@benbovy, this appears to work properly now. The current behavior is:

  • if we detect a full domain (obj.sizes["cells"] == 12 * 4**level), we use a special constructor and don't look at the values
  • if the data of cell_ids is not a dask array, we use RangeMOCIndex.from_cell_ids
  • else we create a RangeMOCIndex for each chunk, then reduce using RangeMOCIndex.union

Missing functionality (compared to PandasIndex) is isel with arrays and sel, because I couldn't figure out how to implement those on MOCs so far.

I'll look into adding a few more tests before this is ready for merging.

@benbovy
Copy link
Member

benbovy commented Jun 12, 2025

I guess that means we should mandate that create_variables must be implemented if isel was overridden?

In the case of HealpixMocIndex yes, but maybe not in all cases. This is the kind of things that we might want to clarify in Xarray's documentation or via blog posts using, e.g., some flowcharts.

@keewis
Copy link
Collaborator Author

keewis commented Jun 12, 2025

This is the kind of things that we might want to clarify in Xarray's documentation or via blog posts

agreed, although I believe that in the end it should end up somewhere in the documentation (blog posts can easily become out-of-date).

Either way, I think we have all the tests necessary, so this should be ready for review / merging.

Copy link
Member

@benbovy benbovy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @keewis, this will be a great addition!

I did a 1st pass and left a few comments. I'll do a more in-depth review next week.

"cdshealpix",
"healpix-geo>=0.0.3",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to make it an optional dependency? I'm concerned by the number of dependencies (compiled packages) xdggs would require as we add more support for various DGGS.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, I think we should figure out how to resolve this. When we added plotting support we decided to install everything by default (it was easier that way), but that won't scale (see also #128). However, I think we should limit the builtin DGGS to just h3 and healpix, and have separate extension packages for others.

healpix-geo is a bit special in that you will need it to support ellipsoidal healpix, so that may just stay required (or in the same extra as cdshealpix, should we modify the code to make h3ronpy and cdshealpix optional)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I think we should limit the builtin DGGS to just h3 and healpix

Yes we can limit builtin support to just a few DGGSs (S2 might probably also make sense), but even so users who may be interested in only H3 would need to download a few Healpix dependencies, which is probably not ideal.

We can address that later, though.

@keewis
Copy link
Collaborator Author

keewis commented Jun 16, 2025

I think I addressed the review comments (and I'll punt the decision on optional vs required dependencies to a different PR), so this should be ready for another round of reviews?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants