miniblock chunk metadata size #4848

niyue · 2025-09-30T07:12:47Z

niyue
Sep 30, 2025

In many compression algorithms (e.g., Zstandard), larger blocks often yield better compression ratios. However, in the current miniblock design, metadata constraints prevent flexible trade-offs between time and space.

Specifically:

The MiniBlockChunk structure restricts each chunk’s buffer length to u16 (maximum 64 KB).
When serializing a miniblock, only 12 + 3 bits are used, further limiting the buffer length to 32 KB.
For binary columns, the default chunk size is 4 KB.

To explore the impact of lifting these limits, I modified the implementation to use u32 for miniblock chunk metadata and conducted an experiment:

Dataset: 100k rows, each row containing a 200+ bytes text column (server access logs).
Schema: three additional columns were present, but their size was negligible (all <100 KB after compression). I did not record their exact values, as they were not relevant to the experiment.

Binary Aim Mini Chunk Size	File Size (100k rows)	Compression Ratio (vs 4 KB)
4 KB	10.0 MB	1.00×
8 KB	8.2 MB	1.22×
16 KB	6.7 MB	1.49×
32 KB	5.5 MB	1.82×
64 KB	4.7 MB	2.13×
128 KB	4.1 MB	2.44×
256 KB	3.6 MB	2.78×
512 KB	3.3 MB	3.03×
1024 KB	3.1 MB	3.23×

The results clearly show that larger chunk sizes lead to better compression ratios. In scenarios where data is written significantly more than it is queried, storage cost dominates, and the ability to increase chunk size could provide meaningful benefits. Currently, limiting the chunk metadata to u16 saves a small amount of space but prevents such trade-offs.

I wonder whether there is a way to make this tunable, allowing more flexibility to adapt Lance to different usage scenarios. Thank you.

westonpace · 2025-09-30T18:37:04Z

westonpace
Sep 30, 2025
Maintainer

I'm not too worried about the u16 / u32 switch (this was probably premature optimization, there are generally not that many chunks).

However, the chunk size is also the minimum unit of read and so a larger compression window means higher read amplification and worse random access performance. So I don't think large read windows are a good default.

I'm open to making the chunk size more configurable (it will probably need to be a 2.2 thing at this point since 2.1 is basically out the door and I think changing the metadata size will be difficult to do in a backward compatible way) as long as the larger chunk sizes are opt-in.

Dataset: 100k rows, each row containing a 200+ bytes text column (server access logs).

I wonder if you can provide or suggest a good sample dataset? There are some interesting ideas for log compression in https://www.vldb.org/pvldb/vol18/p2362-wang.pdf which might be able to yield the best of both worlds.

0 replies

niyue · 2025-10-09T02:47:04Z

niyue
Oct 9, 2025
Author

I'm open to making the chunk size more configurable (it will probably need to be a 2.2 thing at this point since 2.1 is basically out the door and I think changing the metadata size will be difficult to do in a backward compatible way) as long as the larger chunk sizes are opt-in

That's great, I will give it a try to see if I can figure it out

I wonder if you can provide or suggest a good sample dataset?

Currently, I use a sample access log from Kaggle for testing (https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs).

For other types of logs, I sometimes use datasets from Loghub (https://github.com/logpai/loghub).

There are some interesting ideas for log compression in https://www.vldb.org/pvldb/vol18/p2362-wang.pdf

Thanks for the reference! I’ve come across this paper before but haven’t had the chance to read it carefully yet — I’ll take a closer look later.

0 replies

niyue · 2025-10-15T03:44:28Z

niyue
Oct 15, 2025
Author

@westonpace I went ahead and created a PR for this. Would love your feedback when you have a moment: #4959

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

miniblock chunk metadata size #4848

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

miniblock chunk metadata size #4848

Uh oh!

Uh oh!

niyue Sep 30, 2025

Replies: 3 comments

Uh oh!

westonpace Sep 30, 2025 Maintainer

Uh oh!

niyue Oct 9, 2025 Author

Uh oh!

niyue Oct 15, 2025 Author

niyue
Sep 30, 2025

westonpace
Sep 30, 2025
Maintainer

niyue
Oct 9, 2025
Author

niyue
Oct 15, 2025
Author