-
Notifications
You must be signed in to change notification settings - Fork 17
ZEP10: Generic extensions proposal #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi @joshmoore, just a process question, it would seem beneficial to get this PR merged asap so it becomes visible as a draft zep on the zeps website. Who needs to approve that, and what checks would need to be done at this stage to allow merging? E.g., does someone just need to check that the document has the right structure for a ZEP? If so, I'd be happy to approve. |
I know we've done that in the past for ZEPs but then it is actually harder to comment on it --- I'd need to open a separate issue for each comment.. |
For merging in the "Draft", yes, that suffices. From https://zarr.dev/zeps/active/ZEP0000.html#submitting-a-zep
I'm certainly all for leaving it open for a bit, especially for the discussion of the material that is only here (as @jbms has done above). I can manage having it open and synchronizing with the specs PR. That being said, if possible, I'd like to get it merged as a "Draft" and then will also keep updating it as necessary to stay in step with discussions on zarr-developers/zarr-specs#344 |
seconding @jbms, I rate the ability to discuss the ZEP as a single PR much higher than seeing it listed on the ZEP web site, so I would rather we keep this PR open until it's clear that all the questions have been answered. |
Note that in this example of the extension is ``must_understand=true`` meaning | ||
an implementation which does not support the ``example.offset`` extension | ||
should raise an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when should that error be raised? when reading metadata, or when reading chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the impl doesn't know the example.offset
extension, it must fail when parsing the metadata.
It may fail with a out-of-bounds error when reading/writing data outside the domain. But that would be up to the specification for this extension to define.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the impl doesn't know the example.offset extension, it must fail when parsing the metadata.
It seems to me that a zarr-compatible application should be able to say, for example, "this is an array with shape <shape>
, but I can't load chunks for you because of <unknown extension>
". Your suggesting that the metadata document should be effectively unreadable prevents this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that a zarr-compatible application should be able to say, for example, "this is an array with shape
<shape>
, but I can't load chunks for you because of<unknown extension>
".
I think that would be a good implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be a good implementation.
Since the behavior I described relies on reading the metadata without an error, this PR should clarify the distinction between reading metadata documents and other IO operations (e.g., reading chunks, in this example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are purely displaying information to a user and including a warning that an unknown extension was encountered, then displaying whatever information can be heuristically extracted from the metadata successfully may be reasonable.
In general though if there is an unknown extension, you can't really make any assumptions about the meaning of the metadata and any programmatic use is problematic.
For example, the offset
extension may mean that the upper bound of the array is no longer indicated by shape
but by offset + shape
, and the chunk grid starts at offset
rather than (0, ...)
. Maybe there is some program that partitions zarr arrays according to the chunking and then hands off those zarr arrays to worker processes. If the partition program does not support the offset
extension, but the worker program does support the offset
extension, then the partition program will perform the partitioning incorrectly, but the worker processes may process them without errors, but not correctly aligned to the chunk grid.
Concretely, I'd say that if there is an unknown must_understand=true extension, zarr.open
and similar interfaces should not appear to succeed and allow querying properties like the chunk grid, dtype, etc. unless the user explicitly opts into ignoring unknown extensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general though if there is an unknown extension, you can't really make any assumptions about the meaning of the metadata and any programmatic use is problematic.
I find this outcome concerning, as it amounts to fragmenting the zarr ecosystem.
I think this document should explain why the pre-existing |
For |
Co-authored-by: Davis Bennett <[email protected]>
Co-authored-by: Davis Bennett <[email protected]>
to be clear, the specific thing that would not work if all extensions were in This makes me wonder: how important is it really to exclude non-compliant implementations from accessing (and possible misinterpreting) data? I.e., how much weight should we assign to this feature. Are there real examples of negative outcomes from misinterpreting specialized zarr data? Or is this purely hypothetical? |
Thanks for the feedback, all. I've pushed a number of clarification commits based on them, and tried to resolve the threads appropriately. I have ideas on further examples (esp. encryption as recently discussed on Zulip), but I'd very much welcome any others that may be floating around (as PRs, comments, etc.) There are a few remaining conversations:
These may be easier on a call to work toward consensus rather than extended back and forth here. Since the previous ZEP meeting spot was cancelled, I'd suggest we start with a one-off. Finding time this coming week (May 26+) may be difficult but two options are:
I'd still also like to encourage other implementer voices, @zarr-developers/implementation-council. To ensure everyone feels comfortable contributing, it might be helpful for those who have already shared their perspective to give others space to chime in without feeling the need to immediately respond or defend their thoughts points. |
Thanks Josh and Norman, this looks pretty great! My thoughts based on the PR and comments so far:
|
Can someone give an example of how order-dependent extensions might be used/specified? It seems to me that it would be potentially confusing to have most extensions be order-independent but in certain cases have order dependence. It also seems like it would be quite challenging to specify the order-dependent behavior unless you can map the extension to some composable "interface" like a codec or storage transformer. But if there is such a composable interface, the extension should just be defined as specifying a list of "things" that conform to that interface, and that interface becomes a new extension extension point, e.g. |
..., | ||
"extensions": [ | ||
{ | ||
"name": "example.offset", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the example
part of this name meaningful? I think it would be useful to either define in this document what the naming conventions are or link to the relevant external convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that's from that "Extensions naming" section added by ZEP9.
Is there anything in this proposal that motivates its restriction to Zarr specification 3 rather than both Zarr specification 2 and 3? |
At least from my pov, there is no desire to further evolve the v2 specification. Extensibility was one major motivation of the v3 specification. I think it would be confusing to continue evolving both. |
I've not come up with a compelling one, @jbms. My intuition is that there would be a chance for one member of the pipeline (extA) to be able to update some state (the metadata?) before a later one (extB). Practically, though, I don't see how extA could know enough about extB to inject itself into the list at the right point. So other than "high-priority" extensions which add themselves at the beginning and "low-priority" ones which add themselves at the end, I still don't have a concrete example. But, generally, 👍 on a general "SequenceExtension" style that others can adopt. Perhaps this speaks to a "generic extensions conventions" (or "idioms") section. |
I agree with what @LDeakin said about must_understand --- must understand for writing should always implicitly be true and must_understand applies only for reading. That simplifies things nicely. I am not in favor of using extensions as an attribute namespace for things like ome-zarr that are logically layered on top of zarr itself and don't require changes/deep integration with the zarr implementation, for several reasons:
Instead we should add an attribute section to the zarr-extensions repo (related to the zarr conventions proposal). While technically this is a breaking change in that currently no part of the attribute namespace is reserved for registered names, in practice we could use some prefix or other naming convention for registered attributes such that conflicts with existing uses are very unlikely. As I see it, extensions do have a high cost as far as fragmenting the ecosystem and therefore should be introduced with care, mostly for things that could reasonably be added to the core spec also. |
Co-authored-by: Sanket Verma <[email protected]>
As mentioned in zarr-developers/zarr-specs#344 (comment), since we didn’t manage to have an in-person discussion about any of the above topics and we’re now running into holiday times, the timeline for a vote has been put on hold. Here I want to summarize the main decisions and raise a few lingering questions.
|
one thing that is missing for me is a clear demonstration of an individual or group who needs the functionality added in this ZEP. As in: what, in concrete terms, is blocked by the lack of this feature? Alternatively, what onerous thing are we doing today, that we could simplify with this proposal? The PR does contain examples but they are contrived. I would appreciate real examples, i.e. references to real projects or code. If there are none, then I guess that's fine, but I would find the contents of this proposal 100000 times easier to think about if I could see it as a solution to an acute problem. |
Thanks for the 👍, @LDeakin, but I probably put too much in one comment 😄 So for clarity, let me spell out point 3 in an additional comment to get reactions to it: Is there support for updating the text of ZEP10 to:
cc: @jbms |
This is a follow on to ZEP9 (#65) since #66 limits the scope of ZEP9 solely to phase 1 such that it can be moved to accepted (since zarr-developers/zarr-specs#330 is merged and v3.1 released). This ZEP is equivalent to phase 2 of the original ZEP9 draft and introduces a top-level generic
extensions
field.This ZEP will follow the process laid out in ZEP0 and invites votes from the newly refreshed @zarr-developers/implementation-council. This PR may be proactively merged as a draft, but will not be moved to "accepted" until the related PR on zarr-specs is voted on, merged, and v3.2 released.
Please see zarr-developers/zarr-specs#344 for detailed changes.