Skip to content

Add documentation/examples for new data loaders and help with use case #52

Open
@djhoese

Description

@djhoese

@jhamman just presented on some updates to xbatcher including the new data loader interfaces from #25. I tried to find a documented way of using it and I don't see one. If some could be added that would be great because I've been helping some people at my work use Satpy to prepare data for their machine learning projects and I think the data loader could be a nice optimization. Their preparation work has always ended with saving to NetCDF or zarr. My understanding of these interfaces in xbatcher is that that saving to disk step shouldn't be needed (except for future caching functionality). Is that correct?

The psuedo-code of the most recent project I helped looks something like this:

dates_of_interest = [...]
geographic_bounds_of_interest = [...]

for dt in dates_of_interest:
    abi_filenames = get_goes16_abi_filenames(dt)
    scn = satpy.Scene(reader='abi_l1b', filenames=abi_filenames)
    scn.load(channels_of_interest)

    for bbox in geographic_bounds_of_interest:
        cropped_scn = scn.crop(xy_bbox=bbox)
        cropped_scn.save_datasets(filename="some_bbox_specific_file.nc")

And then they do their ML work based on those NetCDF files. Satpy is all xarray[dask]-based and the actual code for the above does a lot of client.map work (distributed's Client) to do the individual pieces. I can't speak for the researcher I'm helping, but I think if there is an optimization step here by using a data loader to give these "patches" (that's what they call them) of data to pytorch/tensorflow without needing to save to NetCDF that would be a really good example for a certain NASA project we're a part of.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions