Description
Hey guys
Nice work with SEG-Y loader! At our team, we use our own library to interact with SEG-Y data, so I've decided to give a try to MDIO and compare the results of multiple approaches and libraries.
Setup
For my tests, I've used a ~21GB sized SEG-Y with IEEE float32 values (no IBM float shenanigans here).
The cube is post-stack, so it is a cube for seismic interpretation: therefore, it has a meaningful regular 3D structure.
Using the same function you provided in tutorials, get_size
, I've got the following results:
SEG-Y: 21504.17 MB
MDIO: 3943.37 MB
MDIO LOSSY+: 1723.78 MB
SEG-Y QUANTIZED: 5995.96 MB
The LOSSY+ MDIO file was created by using compression_tolerance=(std of amplitude values)
.
The SEG-Y QUANTIZED file was created by quantizing the data and writing SEG-Y file (according to the standard) with int8 values.
The system is using Intel(R) Xeon(R) Gold 6242R CPU, just in case that may be of interest.
The tests
Multiple formats are tested against the tasks of loading slices (2D arrays) across three dimensions: INLINE, CROSSLINE, SAMPLES.
Also the ability to load sub-volumes (3D arrays) is tested.
For more advanced usage I have tests for loading batches of data: more on that later.
For tests, I use following engines:
- vanilla
segyio
-- public functions from this great library segfast
-- our in-house library for loading any SEG-Y cubessegfast with segyio engine
-- essentially, a better cookedsegyio
where we use their private methodsseismiqb
-- our library for seismic interpretation (optimized for post-stack cubes) onlyseismiqb HDF5
-- converts the data to HDF5 (very similar to zarr you use)segfast quantized
-- automatically quantized (optimally in some information sense) SEG-Y data is written with int8 dtype
To this slew of engines, I've added MDIO
loader, which looks very simple:
slide_i = mdio_file[idx, :, :][2]
slide_x = mdio_file[:, idx, :][2]
slide_d = mdio_file[:, :, idx][2]
I also used mdio_file._traces[idx, :, :]
but have not noticed significant differences.
The results
An image is better than a thousand words, so a bar-plot of timings for loading INLINE slices:
The situation does not get better on CROSSLINE / SAMPLES axes either:
Note that even naive segyio
, which takes a full sweep across file to get depth-slice, has the same speed.
The why
Some of the reasons for this slowness are apparent: during the conversion process, the default chunk_size for ZARR is 64x64x64
. Therefore, loading 2D slices is not the forte of this exact chunking.
Unfortunately, even when it comes to 3D sub-volumes, the situation is not much better:
Even with this being the best (and only) scenario for chunked storage, it is still not as fast as plain SEG-Y storage, even with no quantization.
Questions
This leaves a few questions:
- is it possible to speed up somehow the loading times? Maybe, I am just not using the right methods from the library.
Or, maybe, this is not the area you focus your format on and the current loading times are fine for the usecases you plan on developing; - is there a way to make a multipurpose file? The way I see it now I can make a file for somewhat fast 2D INLINE slices (by setting chunk_size to
1x64x64
or somewhat like that), but that would be a mess of lots of files; - is there a way to preallocate memory for the data to load into? That is a huge speedup for all ML applications;
- is there a way to get values of a particular trace headers for the entire cube?
I hope you can help me with those questions!