Skip to content

Nan values when saving parq files with virtualize.to_kerchunk() #339

@QuentinMaz

Description

@QuentinMaz

Hi,

I have used virtualizarr to concatenate several .nc files into one parq one.
I noticed that when I then open the saved dataset, the first value of its index is replaced with nan.
I thus suspect that virtualize.to_kerchunk() might have a bug.

Here how to replicate the issue:

filename = "test"
# synthetic xarray.DataSet inspired by xarray's documentation
temperature = 15 + 8 * np.random.randn(2, 2, 3)
lon = [[-99.83, -99.32], [-99.79, -99.23]]
lat = [[42.25, 42.21], [42.63, 42.59]]
depths = np.arange(150, step=50)
da = xr.DataArray(
    data=temperature,
    dims=["x", "y", "depth"],
    coords=dict(
        lon=(["x", "y"], lon),
        lat=(["x", "y"], lat),
        depth=depths
    ),
    attrs=dict(
        description="Ambient temperature.",
        units="degC",
    ),
)
ds = da.to_dataset(name="temperature")
ds.to_netcdf(f"{filename}.nc")
vds = open_virtual_dataset(
    f"{filename}.nc",
    indexes={}, 
    decode_times=True, 
    loadable_variables=["lon", "lat", "depth"]
)
print("depth index of vds:\t\t", vds.depth.to_numpy())
# depth index of vds:		 [  0  50 100]

# saves as parq/ folder
vds.virtualize.to_kerchunk(f"{filename}.parq", format="parquet")
loaded_ds = xr.open_dataset(f"{filename}.parq", engine="kerchunk", chunks={})
print("depth index of the loaded vds:\t", loaded_ds.depth.to_numpy())
# depth index of the loaded vds:	 [ nan  50. 100.]

# temporary fix
loaded_ds.coords["depth"].values[0] = 0.
print("index after fix:\t\t", loaded_ds.depth.to_numpy())
# index after fix:		 [  0.  50. 100.]

I am a beginner and have therefore no idea of the cause...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions