Replies: 1 comment 2 replies
-
It's been a long time since I used HDF databases. When you mean performance degradation, do you mean compared to reading from a CSV file? Or do you mean the database creating is just slow? I think if you lower the compression, it might potentially speed up a bit |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I'm trying to convert several TSV files from c4 200M dataset into HDF5 format and I based my conversion on your notebook.
The dataset is composed by 10 files, each file containing approximately 18 million records, with 2 string columns.
Given the size of the dataset, I thought that converting it to HDF5 format would give a significant benefit and would allow me to know the shape of each file and give significant performance boost in the read of chunks of the dataset.
In a first trial I converted 1 million records in about 3 minutes, however when I tried to convert all 18 million records it is taking more than 6 hours per file.
I am currently loading my tsv in the following way
where i set num_lines equal to the total lines of each file, where chunksize = 10000.
I did not expect this performance degradation, have you ever tried to use your code to convert dataset of a similar dimension?
Thanks in advance.
Beta Was this translation helpful? Give feedback.
All reactions