Replies: 3 comments
-
|
I'm not too worried about the u16 / u32 switch (this was probably premature optimization, there are generally not that many chunks). However, the chunk size is also the minimum unit of read and so a larger compression window means higher read amplification and worse random access performance. So I don't think large read windows are a good default. I'm open to making the chunk size more configurable (it will probably need to be a 2.2 thing at this point since 2.1 is basically out the door and I think changing the metadata size will be difficult to do in a backward compatible way) as long as the larger chunk sizes are opt-in.
I wonder if you can provide or suggest a good sample dataset? There are some interesting ideas for log compression in https://www.vldb.org/pvldb/vol18/p2362-wang.pdf which might be able to yield the best of both worlds. |
Beta Was this translation helpful? Give feedback.
-
That's great, I will give it a try to see if I can figure it out
Currently, I use a sample access log from Kaggle for testing (https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs). For other types of logs, I sometimes use datasets from Loghub (https://github.com/logpai/loghub).
Thanks for the reference! I’ve come across this paper before but haven’t had the chance to read it carefully yet — I’ll take a closer look later. |
Beta Was this translation helpful? Give feedback.
-
|
@westonpace I went ahead and created a PR for this. Would love your feedback when you have a moment: #4959 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
In many compression algorithms (e.g., Zstandard), larger blocks often yield better compression ratios. However, in the current miniblock design, metadata constraints prevent flexible trade-offs between time and space.
Specifically:
MiniBlockChunkstructure restricts each chunk’s buffer length tou16(maximum 64 KB).12 + 3bits are used, further limiting the buffer length to 32 KB.To explore the impact of lifting these limits, I modified the implementation to use
u32for miniblock chunk metadata and conducted an experiment:The results clearly show that larger chunk sizes lead to better compression ratios. In scenarios where data is written significantly more than it is queried, storage cost dominates, and the ability to increase chunk size could provide meaningful benefits. Currently, limiting the chunk metadata to u16 saves a small amount of space but prevents such trade-offs.
I wonder whether there is a way to make this tunable, allowing more flexibility to adapt Lance to different usage scenarios. Thank you.
Beta Was this translation helpful? Give feedback.
All reactions