-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Describe the enhancement requested
I've been looking at why Arrow's access of parquet files on an S3 store are slower when compared to Polars and ClickHouse. A packet capture highlighted what the problem is. For a single parquet file read the following S3 requests are made:
- HEAD
- HEAD
- HEAD
- Oversized ranged GET of the tail of the object to read the metadata block
- HEAD
- Ranged GETs to read the object data
If there's any significant latency between the Arrow client and the S3 (which is likely), all these requests translate into a performance bottleneck. I'm using MinIO and there's a very noticable difference in overall read performance for a client that's 1ms away from the server and one that 30 ms away. That's 150 ms before any data is transfered when it could be 60 ms. The impact gets worse the further apart the client and server are with AWS S3 and GCS likely being the worst cases.
Compare to what Polars does to read the same parquet file:
- HEAD
- 8 byte read at end to get metadata size
- Precise tail read to get metadata
- Ranged GETs from the start to read the table metadata
Arguably it could be even smarter to just read the last 64KB and save a request instead of doing an exact read of the metadata.
ClickHouse is smart when it comes to smaller objects, doing a HEAD and just grabbing the whole object in one go if the size is below some threshold. For larger objects, it does what Arrow does with too many HEAD requests (one less than Arrow).
I've tried allow_delayed_open
but this seems to make no difference to S3 read requests despite the documentation hinting that it might. allow_delayed_open
does help with the efficiency of writing smaller objects though.
Are there any plans to improve the efficient of Arrow's S3 reads?
Component(s)
C++