Loading speed

Hey guys

Nice work with SEG-Y loader! At our team, we use our own library to interact with SEG-Y data, so I've decided to give a try to MDIO and compare the results of multiple approaches and libraries.


## Setup

For my tests, I've used a ~21GB sized SEG-Y with IEEE float32 values (no IBM float shenanigans here). 
The cube is post-stack, so it is a cube for seismic interpretation: therefore, it has a meaningful regular 3D structure. 

Using the same function you provided in tutorials, `get_size`, I've got the following results:
```
SEG-Y:	            21504.17 MB
MDIO:	             3943.37 MB
MDIO LOSSY+:	     1723.78 MB
SEG-Y QUANTIZED:     5995.96 MB
```
The **LOSSY+ MDIO** file was created by using `compression_tolerance=(std of amplitude values)`. 
The **SEG-Y QUANTIZED** file was created by quantizing the data and writing SEG-Y file (according to the standard) with int8 values.

The system is using Intel(R) Xeon(R) Gold 6242R CPU, just in case that may be of interest.

## The tests

Multiple formats are tested against the tasks of loading slices (2D arrays) across three dimensions: INLINE, CROSSLINE, SAMPLES. 
Also the ability to load sub-volumes (3D arrays) is tested. 
For more advanced usage I have tests for loading batches of data: more on that later.

For tests, I use following engines:
* vanilla `segyio` -- public functions from this great library
* `segfast` -- our in-house library for loading any SEG-Y cubes
* `segfast with segyio engine` -- essentially, a better cooked `segyio` where we use their private methods
* `seismiqb` -- our library for seismic interpretation (optimized for post-stack cubes) only
* `seismiqb HDF5` -- converts the data to HDF5 (very similar to zarr you use)
* `segfast quantized` -- automatically quantized (optimally in some information sense) SEG-Y data is written with int8 dtype

To this slew of engines, I've added `MDIO` loader, which looks very simple:

```python
slide_i = mdio_file[idx, :, :][2]
slide_x = mdio_file[:, idx, :][2]
slide_d = mdio_file[:, :, idx][2]
```
I also used `mdio_file._traces[idx, :, :]` but have not noticed significant differences.

## The results
An image is better than a thousand words, so a bar-plot of timings for loading INLINE slices:
![image](https://github.com/TGSAI/mdio-python/assets/47103382/3db1065a-8f7a-484b-8952-1dc8f6d5a06d)

The situation does not get better on CROSSLINE / SAMPLES axes either:
![image](https://github.com/TGSAI/mdio-python/assets/47103382/3de6cbbb-a10d-4dec-a1c8-515fe1d88fa2)
![image](https://github.com/TGSAI/mdio-python/assets/47103382/3303773f-4a42-41e6-aea1-eb48f37e36c4)

Note that even naive `segyio`, which takes a full sweep across file to get depth-slice, has the same speed.


## The why
Some of the reasons for this slowness are apparent: during the conversion process, the default chunk_size for ZARR is `64x64x64`. Therefore, loading 2D slices is not the forte of this exact chunking.

Unfortunately, even when it comes to 3D sub-volumes, the situation is not much better:
![image](https://github.com/TGSAI/mdio-python/assets/47103382/07dfeff3-394b-4afc-8c6f-bde843df37cb)

Even with this being the best (and only) scenario for chunked storage, it is still not as fast as plain SEG-Y storage, even with no quantization.

## Questions

This leaves a few questions:
* is it possible to speed up somehow the loading times? Maybe, I am just not using the right methods from the library.
Or, maybe, this is not the area you focus your format on and the current loading times are fine for the usecases you plan on developing;
* is there a way to make a multipurpose file? The way I see it now I can make a file for somewhat fast 2D INLINE slices (by setting chunk_size to `1x64x64` or somewhat like that), but that would be a mess of lots of files;
* is there a way to preallocate memory for the data to load into? That is a huge speedup for all ML applications;
*  is there a way to get values of a particular trace headers for the entire cube?

I hope you can help me with those questions!




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Loading speed #259

Setup

The tests

The results

The why

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Loading speed #259

Description

Setup

The tests

The results

The why

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions