-
Notifications
You must be signed in to change notification settings - Fork 0
Consolidation issue fix. #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
k0ushal
commented
May 30, 2025
- Fixed the tier based candidate selection
- Default tiers are powers of 4 with the first tier being 0-4M followed by 4-16M, 16-64M and so on.
- Fixed consolidation window of size 4
Documentation: |
57714c9
to
c1e6ebb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments as we talked about. Looks good to me!
core/utils/index_utils.cpp
Outdated
mergeBytes += itrMeta->byte_size; | ||
skew = static_cast<double>(itrMeta->byte_size) / mergeBytes; | ||
delCount += (itrMeta->docs_count - itrMeta->live_docs_count); | ||
mergeScore = skew + (1.0 / (1 + delCount)); | ||
cost = mergeBytes * mergeScore; | ||
|
||
size_t size_before_consolidation = 0; | ||
size_t size_after_consolidation = 0; | ||
size_t size_after_consolidation_floored = 0; | ||
for (auto& segment_stat : consolidation) { | ||
size_before_consolidation += segment_stat.meta->byte_size; | ||
size_after_consolidation += segment_stat.size; | ||
size_after_consolidation_floored += | ||
std::max(segment_stat.size, floor_segment_bytes); | ||
} while (itr++ != end); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably inconsequential, but it would suffice to calculate skew
, mergeScore
and cost
once after the loop for the last element.
core/utils/index_utils.cpp
Outdated
size_t nextTier = ConsolidationConfig::tier1; | ||
while (nextTier < num) | ||
nextTier = nextTier << 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: You could probably use std::countl_zero
and get rid of the loop.
core/utils/index_utils.cpp
Outdated
mergeBytes = mergeBytes - removeMeta->byte_size + addMeta->byte_size; | ||
skew = static_cast<double>(addMeta->byte_size) / mergeBytes; | ||
delCount = delCount - getDelCount(removeMeta) + getDelCount(addMeta); | ||
mergeScore = skew + (1 / (1 + delCount)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As already discussed:
We should think about whether calculating the mergeScore
this way is sensible. What seems strange is that while the skew is a ratio (of byte-sizes), the second summand is an inverse count. This seems off: intuitively I'd expect e.g. a ratio of live and total documents to be considered alongside the skew.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually quite bad the way it is, worse than we noticed yesterday @k0ushal.
Note that
.
So this way we are always allowed to consolidate if only one document has been deleted, no matter the size of the files or number of documents therein.
Let us at least do
mergeScore = skew + live_docs_count / total_docs_count;
instead, as discussed - this has more reasonable properties.
And as a second observation @neunhoef made today while discussing this: Adding these two values is probably not right, either. They should be multiplied instead; the maxMergeScore
will need to be adjusted to 0.5
to get a similar effect.
So we should actually do
mergeScore = skew * live_docs_count / total_docs_count;
(and adapt maxMergeScore
).
To understand this better, we should still do some formal worst-case analysis and some tests (specifically unit tests of the consolidation algorithm that play out certain usage scenarios).
core/utils/index_utils.hpp
Outdated
for (auto idx = start; idx != sorted_segments.end();) { | ||
|
||
if (getSize(*idx) <= currentTier) { | ||
idx++; | ||
continue; | ||
} | ||
|
||
tiers.emplace_back(start, idx - 1); | ||
|
||
// The next tier may not necessarily be in the | ||
// next power of 4. | ||
// Consider this example, | ||
// [2, 4, 6, 8, 900] | ||
// While the 2, 4 fall in the 0-4 tier and 6, 8 fall | ||
// in the 4-16 tier, the last segment falls in | ||
// the [256-1024] tier. | ||
|
||
currentTier = getConsolidationTier(getSize(*idx)); | ||
start = idx++; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed: finding the tier-boundaries could be done by binary search, possibly utilizing std::lower_bound
/ std::upper_bound
.
core/utils/index_utils.cpp
Outdated
mergeBytes = mergeBytes - removeMeta->byte_size + addMeta->byte_size; | ||
skew = static_cast<double>(addMeta->byte_size) / mergeBytes; | ||
delCount = delCount - getDelCount(removeMeta) + getDelCount(addMeta); | ||
mergeScore = skew + (1 / (1 + delCount)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually quite bad the way it is, worse than we noticed yesterday @k0ushal.
Note that
.
So this way we are always allowed to consolidate if only one document has been deleted, no matter the size of the files or number of documents therein.
Let us at least do
mergeScore = skew + live_docs_count / total_docs_count;
instead, as discussed - this has more reasonable properties.
And as a second observation @neunhoef made today while discussing this: Adding these two values is probably not right, either. They should be multiplied instead; the maxMergeScore
will need to be adjusted to 0.5
to get a similar effect.
So we should actually do
mergeScore = skew * live_docs_count / total_docs_count;
(and adapt maxMergeScore
).
To understand this better, we should still do some formal worst-case analysis and some tests (specifically unit tests of the consolidation algorithm that play out certain usage scenarios).
07286d8
to
872d553
Compare
fb73fcd
to
f6305e3
Compare
f6305e3
to
21a2f95
Compare
- Fixed the tier based candidate selection - Default tiers are powers of 4 with the first tier being 0-4M followed by 4-16M, 16-64M and so on. - Fixed consolidation window of size 4
21a2f95
to
d91b909
Compare
Changed consolidation config defaults
Disabled irrelevant tests
79070ae
to
5165f01
Compare
5165f01
to
9cfc1fc
Compare