-
Notifications
You must be signed in to change notification settings - Fork 63
Initial attempt at 2.0 spec #278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
||
| The `crs` field of GeoParquet MUST reflect the crs of the Parquet `crs` property on the GEOMETRY or GEOGRAPHY logical type. If the `srid` | ||
| option is used in Parquet then it MUST reference a CRS identifier from the [Spatial reference identifier](https://spatialreference.org/projjson_index.json) catalog. The GeoParquet `crs` field MUST have the same projjson that is in the corresponding spatial reference identifier catalog. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I contemplating putting more language in here, to try to really guide people more than the core Parquet spec does, as it's a bit vague / underspecified, but ran out of time and am not super sure what to do. I borrowed from #276 the language on having the CRS identifier - I think we should make it required that it's an actual identified CRS, not just some random authority and code.
I still don't really understand the projjson option in the Parquet spec - where that projjson is supposed to live. Seems like we could narrow the options here and give people a clearer recommendation, but I don't yet understand what that would be. I also was looking at what @rouault did in GDAL did and it seemed like maybe he just had it stick the projjson inline, I see:
required binary field_id=-1 crop:code_list (String);
optional binary field_id=-1 geometry
(Geometry(crs={"$schema":"https://proj.org/schemas/v0.7/projjson.schema.json","type":"ProjectedCRS","name":"MGI / Austria
Lambert","base_crs":{"type":"GeographicCRS","name":"MGI","datum":{"type":"GeodeticReferenceFrame","name":"Militar-Geographische
Institut","ellipsoid":{"name":"Bessel
1841","semi_major_axis":6377397.155,"inverse_flattening":299.1528128}},"coordinate_system":{"subtype":"ellipsoidal","axis":[{"name":"Geodeti
c latitude","abbreviation":"Lat","direction":"north","unit":"degree"},{"name":"Geodetic
longitude","abbreviation":"Lon","direction":"east","unit":"degree"}]},"id":{"authority":"EPSG","code":4312}},"conversion":{"name":"Austria
Lambert","method":{"name":"Lambert Conic Conformal (2SP)","id":{"authority":"EPSG","code":9802}},"parameters":[{"name":"Latitude of false
origin","value":47.5,"unit":"degree","id":{"authority":"EPSG","code":8821}},{"name":"Longitude of false
origin","value":13.3333333333333,"unit":"degree","id":{"authority":"EPSG","code":8822}},{"name":"Latitude of 1st standard
parallel","value":49,"unit":"degree","id":{"authority":"EPSG","code":8823}},{"name":"Latitude of 2nd standard
parallel","value":46,"unit":"degree","id":{"authority":"EPSG","code":8824}},{"name":"Easting at false
origin","value":400000,"unit":"metre","id":{"authority":"EPSG","code":8826}},{"name":"Northing at false
origin","value":400000,"unit":"metre","id":{"authority":"EPSG","code":8827}}]},"coordinate_system":{"subtype":"Cartesian","axis":[{"name":"N
orthing","abbreviation":"X","direction":"north","unit":"metre"},{"name":"Easting","abbreviation":"Y","direction":"east","unit":"metre"}]},"s
cope":"Topographic mapping (medium and small
scale).","area":"Austria.","bbox":{"south_latitude":46.4,"west_longitude":9.53,"north_latitude":49.02,"east_longitude":17.17},"id":{"authori
ty":"EPSG","code":31287}}));
}
Or maybe that's just how it's showing up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long answer below, but the tl;dr is that I think inlining PROJJSON into the logical type is the most likely way to ensure the CRS is understood by the consumer. Keeping the PROJJSON requirement at the top level is probably good (although allowing the string "authority:code" might be helpful to unify it with the raster spec).
The Arrow C++ implementation (and probably the Rust implementation will do this too) inlines PROJJSON (or: inlines the GeoArrow CRS, which is usually PROJJSON but is permitted to be other things) into the CRS field because it turns out that zero Parquet implementations expose a way to add to the global key/value metadata store while converting into Parquet types. The other direction is also true...Parquet implementations didn't provide a reference to the global metadata in their type conversion code, so the projjson:metadata_key thing doesn't always work. (It does for Arrow C++ and GDAL I believe but I had to add it to Arrow C++ specifically to make this work).
I prototyped the top-level key/value metadata thing in Arrow C++ once and the other issue I ran into is that, because schema metadata is actively propagated, you have to be very careful what you name the CRS key or you end up with a reference to the wrong CRS if you read/modify/write a file. The best solution I could come up with was naming the key/value metadata projjson_crs_uglyHashOfTheJsonToAvoidCollisions (and then I ended up with more than one of them in some reasonably simple read/modify/write scenarios where only some of those values were referred to in the latest version of the file).
srid:<must be an integer according to the spec> doesn't actively refer to EPSG in the spec so there's no guarantee a consumer will interpret it that way (although I did add an example file and a note in the Parquet example files explaining that most producers will probably mean this and to please avoid actual EPSG codes if you're trying to use this an opaque identifier).
Non Arrow readers and writers have some more flexibility here because they don't have to depend on the GeoArrow--Parquet Type conversion in Arrow C++ (or soon arrow-rs).
I put examples in the Parquet test files (and a readme explaining how various CRS values might be written/interpreted, including inlined PROJJSON). https://github.com/apache/parquet-testing/tree/master/data/geospatial#geospatial-test-files .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paleolimbot - this all makes good sense. Looking at https://github.com/apache/parquet-testing/blob/master/data/geospatial/crs-projjson.parquet I just see:
Schema:
<pyarrow._parquet.ParquetSchema object at 0x1051b3a00>
required group field_id=-1 schema {
optional binary field_id=-1 wkt (String);
optional binary field_id=-1 geometry (Geometry(crs=projjson:projjson_epsg_5070));
}
Am I right in understanding that you didn't actually put the projjson in? Shouldn't it be what's in https://spatialreference.org//ref/epsg/5070/projjson.json instead of projjson_epsg_5070?
Like it should be:
<pyarrow._parquet.ParquetSchema object at 0x1051b3a00>
required group field_id=-1 schema {
optional binary field_id=-1 wkt (String);
optional binary field_id=-1 geometry (Geometry(crs=projjson:{"$schema":"https://proj.org/schemas/v0.7/projjson.schema.json","type":"ProjectedCRS","name":"NAD83 / Conus Albers","base_crs":{"type":"GeographicCRS","name":"NAD83","datum":{"type":"GeodeticReferenceFrame","name":"North American Datum 1983","ellipsoid":{"name":"GRS 1980","semi_major_axis":6378137,"inverse_flattening":298.257222101}},"coordinate_system":{"subtype":"ellipsoidal","axis":[{"name":"Geodetic latitude","abbreviation":"Lat","direction":"north","unit":"degree"},{"name":"Geodetic longitude","abbreviation":"Lon","direction":"east","unit":"degree"}]},"id":{"authority":"EPSG","code":4269}},"conversion":{"name":"Conus Albers","method":{"name":"Albers Equal Area","id":{"authority":"EPSG","code":9822}},"parameters":[{"name":"Latitude of false origin","value":23,"unit":"degree","id":{"authority":"EPSG","code":8821}},{"name":"Longitude of false origin","value":-96,"unit":"degree","id":{"authority":"EPSG","code":8822}},{"name":"Latitude of 1st standard parallel","value":29.5,"unit":"degree","id":{"authority":"EPSG","code":8823}},{"name":"Latitude of 2nd standard parallel","value":45.5,"unit":"degree","id":{"authority":"EPSG","code":8824}},{"name":"Easting at false origin","value":0,"unit":"metre","id":{"authority":"EPSG","code":8826}},{"name":"Northing at false origin","value":0,"unit":"metre","id":{"authority":"EPSG","code":8827}}]},"coordinate_system":{"subtype":"Cartesian","axis":[{"name":"Easting","abbreviation":"X","direction":"east","unit":"metre"},{"name":"Northing","abbreviation":"Y","direction":"north","unit":"metre"}]},"scope":"Data analysis and small scale data presentation for contiguous lower 48 states.","area":"United States (USA) - CONUS onshore - Alabama; Arizona; Arkansas; California; Colorado; Connecticut; Delaware; Florida; Georgia; Idaho; Illinois; Indiana; Iowa; Kansas; Kentucky; Louisiana; Maine; Maryland; Massachusetts; Michigan; Minnesota; Mississippi; Missouri; Montana; Nebraska; Nevada; New Hampshire; New Jersey; New Mexico; New York; North Carolina; North Dakota; Ohio; Oklahoma; Oregon; Pennsylvania; Rhode Island; South Carolina; South Dakota; Tennessee; Texas; Utah; Vermont; Virginia; Washington; West Virginia; Wisconsin; Wyoming.","bbox":{"south_latitude":24.41,"west_longitude":-124.79,"north_latitude":49.38,"east_longitude":-66.91},"id":{"authority":"EPSG","code":5070}}));
}
Or is my metadata reader not quite working right or I have some other misunderstanding?
Also looking at GDAL I think it doesn't quite follow your recommendation here. It just seems to say crs=, but it should be crs=projjson:.
Like it's
optional binary field_id=-1 geometry (Geometry(crs={"$schema":"https://proj.org/schemas/v0.7/projjson.schema.json","type":"ProjectedCRS","name":"MGI
/ Austria Lambert","base_crs":{"type":"GeographicCRS","name":"MGI","datum":{"type":"GeodeticReferenceFrame","name":"Militar-Geographische
Institut","ellipsoid":{"name":"Bessel
1841","semi_major_axis":6377397.155,"inverse_flattening":299.1528128}},"coordinate_system":{"subtype":"ellipsoidal","axis":[{"name":"Geodetic
latitude","abbreviation":"Lat","direction":"north","unit":"degree"},{"name":"Geodetic
longitude","abbreviation":"Lon","direction":"east","unit":"degree"}]},"id":{"authority":"EPSG","code":4312}},"conversion":{"name":"Austria
Lambert","method":{"name":"Lambert Conic Conformal (2SP)","id":{"authority":"EPSG","code":9802}},"parameters":[{"name":"Latitude of false
origin","value":47.5,"unit":"degree","id":{"authority":"EPSG","code":8821}},{"name":"Longitude of false
origin","value":13.3333333333333,"unit":"degree","id":{"authority":"EPSG","code":8822}},{"name":"Latitude of 1st standard
parallel","value":49,"unit":"degree","id":{"authority":"EPSG","code":8823}},{"name":"Latitude of 2nd standard
parallel","value":46,"unit":"degree","id":{"authority":"EPSG","code":8824}},{"name":"Easting at false
origin","value":400000,"unit":"metre","id":{"authority":"EPSG","code":8826}},{"name":"Northing at false
origin","value":400000,"unit":"metre","id":{"authority":"EPSG","code":8827}}]},"coordinate_system":{"subtype":"Cartesian","axis":[{"name":"Northing","
abbreviation":"X","direction":"north","unit":"metre"},{"name":"Easting","abbreviation":"Y","direction":"east","unit":"metre"}]},"scope":"Topographic
mapping (medium and small
scale).","area":"Austria.","bbox":{"south_latitude":46.4,"west_longitude":9.53,"north_latitude":49.02,"east_longitude":17.17},"id":{"authority":"EPSG"
,"code":31287}}));
```
But it should be
```
optional binary field_id=-1 geometry (Geometry(crs=projjson:{"$schema":"https://proj.org/schemas/v0.7/projjson.schema.json","type":"ProjectedCRS","name":"MGI
/ Austria Lambert","base_crs":{"type":"GeographicCRS","name":"MGI","datum":{"type":"GeodeticReferenceFrame","name":"Militar-Geographische
Institut","ellipsoid":{"name":"Bessel
...
```
Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! I think I see, it's in https://github.com/apache/parquet-testing/blob/master/data/geospatial/crs-arbitrary-value.parquet That one does it the same way as gdal.
What are your thoughts on putting some prefix on it? Like crs=projjson_inline:{"$schema":"https://proj.org/schemas/v0.7/projjson.schema.json","type":"ProjectedCRS",... Or is there a good reason to just dump it in straightaway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad you got there...crs-arbitrary-value.parquet is in the test files to ensure that Parquet implementation know that this could happen and are prepared to handle that case.
If we add a prefix today, the files won't work in some released implementations (Arrow C++, GDAL). A good reason to add a prefix is if you specifically want a value's de-facto interpretation as a CRS to not happen (i.e., if you were putting some other type of JSON representation of a CRS into that field).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, cool. Happy to just stick with that.
|
|
||
| The bbox values MUST be in the same coordinate reference system as the geometry. | ||
|
|
||
| #### covering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we want to explain something about the native stats? I figure in a blog post we can explain why covering went away, but seems like we don't need to mention it. But just wondering if there's any advice we could / should give people? We could perhaps recommend some spatial sorting, like bring it in from https://github.com/opengeospatial/geoparquet/blob/main/format-specs/distributing-geoparquet.md#spatial-ordering Similarly we could recommend zstd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think release notes/blog is a good idea. I can add an example of file size savings (I recently calculated that there are about 25GB of coverings in Overture buildings but the percentage is likely much higher for point data with few attributes).
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this! A more formal draft like this one has been on my and Jia's TODO list for a long time but we clearly haven't gotten there 😬
|
|
||
| The `crs` field of GeoParquet MUST reflect the crs of the Parquet `crs` property on the GEOMETRY or GEOGRAPHY logical type. If the `srid` | ||
| option is used in Parquet then it MUST reference a CRS identifier from the [Spatial reference identifier](https://spatialreference.org/projjson_index.json) catalog. The GeoParquet `crs` field MUST have the same projjson that is in the corresponding spatial reference identifier catalog. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long answer below, but the tl;dr is that I think inlining PROJJSON into the logical type is the most likely way to ensure the CRS is understood by the consumer. Keeping the PROJJSON requirement at the top level is probably good (although allowing the string "authority:code" might be helpful to unify it with the raster spec).
The Arrow C++ implementation (and probably the Rust implementation will do this too) inlines PROJJSON (or: inlines the GeoArrow CRS, which is usually PROJJSON but is permitted to be other things) into the CRS field because it turns out that zero Parquet implementations expose a way to add to the global key/value metadata store while converting into Parquet types. The other direction is also true...Parquet implementations didn't provide a reference to the global metadata in their type conversion code, so the projjson:metadata_key thing doesn't always work. (It does for Arrow C++ and GDAL I believe but I had to add it to Arrow C++ specifically to make this work).
I prototyped the top-level key/value metadata thing in Arrow C++ once and the other issue I ran into is that, because schema metadata is actively propagated, you have to be very careful what you name the CRS key or you end up with a reference to the wrong CRS if you read/modify/write a file. The best solution I could come up with was naming the key/value metadata projjson_crs_uglyHashOfTheJsonToAvoidCollisions (and then I ended up with more than one of them in some reasonably simple read/modify/write scenarios where only some of those values were referred to in the latest version of the file).
srid:<must be an integer according to the spec> doesn't actively refer to EPSG in the spec so there's no guarantee a consumer will interpret it that way (although I did add an example file and a note in the Parquet example files explaining that most producers will probably mean this and to please avoid actual EPSG codes if you're trying to use this an opaque identifier).
Non Arrow readers and writers have some more flexibility here because they don't have to depend on the GeoArrow--Parquet Type conversion in Arrow C++ (or soon arrow-rs).
I put examples in the Parquet test files (and a readme explaining how various CRS values might be written/interpreted, including inlined PROJJSON). https://github.com/apache/parquet-testing/tree/master/data/geospatial#geospatial-test-files .
|
|
||
| The bbox values MUST be in the same coordinate reference system as the geometry. | ||
|
|
||
| #### covering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think release notes/blog is a good idea. I can add an example of file size savings (I recently calculated that there are about 25GB of coverings in Overture buildings but the percentage is likely much higher for point data with few attributes).
|
|
||
| # Optionally both commit and push | ||
| default_stages: [commit] | ||
| default_stages: [pre-commit] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, not sure why this changed, I'll revert - I was just trying to get the pre-commit hooks to run locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible that the latest version of pre-commit renamed the stage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, possible.
First draft of a 2.0 specification.
Changes:
The general philosophy of this is to require the full metadata if you want to be GeoParquet 2.0 compliant, and it must match the Parquet metadata. I could imagine a different philosophy where we remove the duplicate information - GeoParquet would not have crs, geometry types, edges / algorithm, and would provide the additional info - primary column, orientation, epoch.
And another approach I've been contemplating is that if a file adds a geometry or geography in Parquet and does it right we call it 'GeoParquet 2.0 compliant', even if they don't write any additional metadata. I stopped short of trying to figure out how exactly to specify that, but I lean towards going in that direction - use the 'brand name' of GeoParquet to encourage Parquet geospatial implementation.
I also have been contemplating if it'd be a good idea to do a GeoParquet 1.2, where we leave in covering and arrow types as optional (though maybe deprecated). @Maxxen spurred that thinking as I believe he said somewhere it could help with backwards compatibility as duckdb code looks for version 1.x in GeoParquet. But I'm more recently leaning away from that, since older libraries using older versions of parquet readers look to mostly just crash when they encounter a geometry or geography logical type. If they wouldn't crash I'd be inclined for a compatibility mode, but it seems like we'd just be aiming at a narrow slice when people upgraded their parquet reader but are too lazy to change the geospatial handling. So I'm more inclined to just push towards getting all geospatial libraries to implement reading 2.0 / parquet types. But I'm definitely open to arguments as to a 1.2 having a value. It wouldn't be hard to do, but the messaging might be a bit weird.
I have more open questions, but will put them in the follow-up.