diff --git a/draft/ZEP0010.md b/draft/ZEP0010.md new file mode 100644 index 0000000..a56f911 --- /dev/null +++ b/draft/ZEP0010.md @@ -0,0 +1,440 @@ +--- +layout: default +title: ZEP0010 +description: This ZEP proposes a new generic extensions field. +parent: draft ZEPs +nav_order: 10 +--- + +# ZEP 10 — Zarr Generic Extensions + +Authors: + +- [Norman Rzepka](https://github.com/normanrz), scalable minds +- [Josh Moore](https://github.com/joshmoore), German BioImaging + +Status: Draft + +Type: Specification + +Created: 2025-05-12 + +## Abstract + +This proposal defines a new generic extension point, ``extensions``, to be +included in the metadata of Zarr v3 arrays and groups. The ``extensions`` field +provides a consistent mechanism for attaching additional metadata that does not +fit into existing extension points defined by the core specification. Extension +entries within this field follow the naming and structure rules established in +ZEP0009. This mechanism enables third parties to define and share metadata +extensions without requiring changes to the core specification or introducing +new top-level keys. + +## Introduction + +Zarr specification version 3 currently defines four extension points, each +associated with a specific (array) metadata field. Additional extension points +may be added by future ZEPs. Until that time, however, third-parties may want +to add arbitrary extension objects to either arrays or groups. This proposal +introduces a generic ``extensions`` field that serves as a container for such a +list of extensions. + +These general purpose extensions are not limited by the scopes of existing +extension points and require no heavy-weight process to add functionality or alter +behavior of arrays and groups. +The intent is to facilitate decentralized and low-friction +innovation within the Zarr ecosystem by enabling third parties to experiment +with new features without requiring immediate changes +to the core specification. +By tolerating a broader range of experimental extensions, the community can +explore diverse use cases and patterns. Over time, widely adopted extensions +may serve as the foundation for future standardization through new ZEPS which +introduce new extension points or even core features. + +## Proposal + +To provide for more flexible, immediate, and de-centralized use cases, we +propose to add a generic extension point ``extensions`` on +both arrays and groups into which extensions MAY be added. + +This field is similar in flexibility to the ``attribues`` field. Conceptually, +``extensions`` is intended primarily for use by software and automated +processes, with the potential to influence behavior or processing logic, +whereas ``attributes`` are generally intended for human interpretation and +serve as passive metadata or provenance information, though the boundaries are +not always distinct. + +By adding a new field, the specification can assert restrictions that if added +to ``attributes``. would amount to a breaking change. If present, the +``extensions`` field MUST contain an array of extension definitions. The +contained array MUST either have one or more extensions or the object MUST be +omitted entirely. Specifying metadata within ``extensions`` as opposed to +``attributes`` allows the clear registration of the extension name, providing a +namespace for the metadata to prevent collisions, and activates the +``must_understand`` handling logic. + +Further details on the specification changes can be found in +. + +### Definition and naming + +Each extension object will follow the rules laid out in the "Zarr extensions" +section of the v3 specification. + +### Processing + +Zarr implementers +are expected to inspect the extensions for each node and determine whether each listed +extension is supported. If an extension includes ``"must_understand": true`` +(the default) and the implementation does not support it, the node must not +be loaded and an appropriate error should be raised. For extensions with +``"must_understand": false``, implementers may safely ignore unrecognized entries. + +To support a given extension, an implementation many hard-code a check for known +extension names and invoke appropriate logic according to the extension’s +specification at the correct point in its processing pipeline (e.g., during +metadata interpretation, data access, or layout resolution). +Where possible, however, implementations are encouraged, to delegate +that logic via a callback or plugin mechanism that allows third-party code to +handle the extension dynamically. + +As the set of extensions evolves, certain interfaces may arise which allow +this modular approach for a subset of extensions. Where possible, these +interfaces will be added to the specification. Feedback from implementers +on such matters is highly encouraged. + +### Examples + +The following examples represent a few realistic use cases of the top-level +``extensions`` container. This ZEP is putting in place the mechanism so the +community can experiment with such extensions before their standardization. + +#### Offset (array) + +```javascript +{ + "zarr_format": 3, + "node_type": "array", + ..., + "extensions": [ + { + "name": "example.offset", + "configuration": { "offset": [ 12, 24 ] } + } + ] +} +``` + +The ``example.offset`` extension contains an array of the same order as the +shape of the containing array specifying which element of the array should be +considered as the origin, e.g., `[0, 0]`. This allows the reuse of subregions +of an array without the need to rewrite the data. + +Note that in this example of the extension is ``must_understand=true`` meaning +an implementation which does not support the ``example.offset`` extension +should raise an error. + +#### Statistics (array) + +```javascript +{ + "zarr_format": 3, + "node_type": "array", + ..., + "extensions": [ + { + "name": "example.array-statistics", + "must_understand": false, + "configuration": { + "min": 5, + "max": 1023 + } + } + ] +} +``` + +The ``example.array-statistics`` extension contains two fields -- ``min`` +and ``max`` specifying the range of values which are present in the array, +reducing the need to read every byte. ``must_understand`` is false, so +implementations can safely ignore the extension. + +#### Skip empty chunks (array) + +```javascript +{ + "zarr_format": 3, + ..., + "extensions": [ + "example.skip_empty_chunks" + ] +} +``` + +Currently the "write_empty_chunks" flag in zarr-python is not propagated +to the zarr.json file. An extension like ``example.skip_empty_chunks`` +could serve as a no-configuration flag in the metadata to inform +implementations that empty chunks should not be written. + + + +#### Multiscale arrays (group) + +```javascript +{ + "zarr_format": 3, + "node_type": "group", + ..., + "extensions": [ + { + "name": "example.multiscale-arrays", + "must_understand": false, + "configuration": { + "multiscale": { + "datasets": [ + "path/to/array/1", + "path/to/array/2", + "path/to/array/3" + ] + } + } + } + ], +} +``` + +Metadata is introduced in the ``example.multiscale-arrays`` +extension which allows encoding a relationship between multiple arrays at the +group level. This defines a "multiscale pyramid" of arrays which is +a common idiom in both the geospatial and bioimaging uses of Zarr. +Implementations may choose to return a different subclass or backend when +detecting such metadata. In this case, a "datatree" which allows similar +operations on all levels of the pyramid might be preferred. + +#### Tiered storage (group) + +```javascript +{ + "zarr_format": 3, + "node_type": "group", + ..., + "extensions": [ + { + "name": "example.tiered-storage", + "must_understand": false, + "configuration": { + "slow-arrays": [ + "path/to/array/1" + ] + } + } + ], +} +``` + +Related to the multiscales example above, an ``example.tiered-storage`` +extension could identify arrays within a group which have been put on +slower or even archived filesystems which will encourage more overhead +and potentially costs if they are accessed. An implementation might +warn users before opening the array. + +### Application to sub-nodes + +This ZEP does not try to define the behavior for application to sub-nodes +itself, but defers this to actual extensions. + +Conceptually, we propose that extensions defined on groups may be valid for +their child nodes. However, the details of how an implementation should +identify which extensions are active within an hierarchy are unclear. Relying +on traversing the hierarchy towards the root node is undesirable from a +performance point of view. + +As a workaround, extension authors can choose to write *some* metadata within +the contained subgroups and arrays to make this easier. Options for what +this metadata could be include: + +1. A copy of the metadata + +```javascript +{ + "extensions": [ + { + "name": "example.my-extension", + "configuration": { ... full copy of the metadata ...} + } + ] + +} +``` + +2. A reference to the metadata as part of the extension itself + +```javascript +{ + "extensions": [ + { + "name": "example.my-extension", + "configuration": { + "reference": "../.." + } + } + ] + +} +``` + +3. A complimentary reference extension + +```javascript +{ + "extensions": [ + { + "name": "example.my-extension-ref", + "configuration": { + "reference": "../.." + } + } + ] + +} +``` + +4. A shared or even core reference extension + +```javascript +{ + "extensions": [ + { + "name": "example.parent-ref", + "configuration": { + "reference": "../.." + } + } + ] +} +``` + +As further experience is gained by the community of extension authors, +one or more of these methods may be adopted into the core spec. + +### Alternatives for the `extensions` extension point + +The current design allows having the same +extension definition syntax across all extension points and reduces pollution +of the top-level namespace in a `zarr.json`. Thus, the addition of top-level +metadata keys remains reserved to changes in the core spec. This MAY happen as +part of the core spec adopting functionality of an extension. + +Alternative designs that were considered are listed below along with their +pros and cons. + +#### Top-level metadata keys + +Instead of a generic extension point, new top-level +extension keys could be added to the metadata:: + +```javascript +{ + "zarr_format": 3, + ... + "example.offset": { "offset": [ 12 ] }, + "example.array-statistics": { + "min": 5, + "max": 12 + }, + "example.consolidated-metadata": { + "must_understand": false, + ... + }, // optional extension + ... +} +``` + +In this case, there would be no explicit `configuration` key within an +extension definition, but instead all the keys of such a configuration would be +in the object itself. Using an object rather than directly for example +an array of values would allow for evolution of the extension. + +This would mean, however, that there are two separate types of +extension definitions, i.e. `{"name":"", "configuration": {...}}` in +specialized extension points (e.g. `codecs`) and `"": {...}` for other +extensions. + +A benefit would be that if an extension becomes adopted into the core spec, implementations +would not need to be updated to support their move from the ``extensions`` object. + +#### Simple `extensions` object + +Instead of an array that holds the extension definitions, an object could alternatively be used:: + +```javascript +{ + "zarr_format": 3, + ... + "extensions": { + "example.offset": { "offset": [ 12 ] }, + "example.array-statistics": { + "min": 5, + "max": 12 + }, + "example.consolidated-metadata": { + "must_understand": false, + ... + } // optional extension + }, + ... +} +``` + +This alternative is similar to the top-level keys, with mostly the same implications. + +This alternative would continue to reserve the top-level namespace for changes to +the core spec and, therefore, reduce pollution of the top-level namespace. Downsides include +that only a single use of each extension would be possible since the key is the extension +name and there would be no ordering of the extensions. + +#### Complex `extensions` object + +Finally, a more complex ``extensions`` object could be defined:: + +```javascript +{ + "zarr_format": 3, + ... + "extensions": { + "version": 1, + "contents": [ + { + "name": "example.offset", + "configuration": { "offset": [ 12 ] } + }, + { + "name": "example.array-statistics", + "configuration: { + "min": 5, + "max": 12 + } + }, + { + "name": "example.consolidated-metadata", + "must_understand": false, + "configuration": { + ... + } + } + ] + }, + ... +} +``` + +This strategy combines the object strategy for extensibility with the uniformity +of using a list of extension definitions, at the cost of a more complex object to parse. + +## Changelog + + - 2025-05-12: Migrate phase 2 of the original ZEP9 + +## Copyright + +This proposal is licensed under [the Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).