Distributed weights (builder)

As the new generation of weights are coming into fruition (#534 ), I wanted to drop an issue to collate a proposal and some ideas @martinfleis and I have been fleshing out over the past couple of years. If anything, at least this can be a space to discuss whether it'd make sense to have on the roadmap at all and, if so, how it can be squared up with all the ongoing parallel plans on weights.

Idea: _support (at least) part of the weights functionality on parallelisable and distributed data structures_

Why: building weights for larger-than-memory datasets would lay the foundation for spatial analytics (i.e., PySAL functionality) at planetary scale. My own take is that in the last ten years, we've gotten away without this because RAM and CPUs have grown faster than the data we were using. I think this is changing because we're able to access datasets significantly bigger (even just at national scale) on which we'd like to run pysal.

How: our proposal (and this is very much for debate too!) is to build functionality that stores weights as adjacency matrices that are stored as `dask.dataframe` objects that don't need to live in memory in full. To _build_ them, we could rely on `dask_geopandas.geodataframe` objects. This way has a couple of advantages:

- Neither the input nor the output needs to be fully in memory so, theoretically, you can scale horizontally (more cores, more machines, same RAM per machine) as much as you want
- It offloads most of the hard stuff that is non-spatial to already available tools that are already pretty robust (e.g., Dask), further integrates PySAL in the broader eco-system, and allows us to focus only on the main blockers that are particular to the problem at hand (i.e., compute and store spatial relationships)

If the above seems reasonable, probably the main technical blocker is defining spatial relationships across different chunks (for geometries within the same chunk it's straightforward), and possibly merge results at the end. I know Martin had some ideas here and there is precedent we built out-of-necessity and very much in ad-hoc ways for the Urban Grammar [here](https://urbangrammarai.xyz/spatial_signatures/measuring/cross-chunk_indices.html) (note this is all before the 0.1 release of `dask-geopandas` so we didn't have a `dask_geopandas.GeoDataFrame` to build upon. Some stuff here might be redundant or it might be possible to notably streamline it).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed weights (builder) #547

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Distributed weights (builder) #547

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions