Skip to content

Conversation

k0ushal
Copy link

@k0ushal k0ushal commented May 30, 2025

  • Fixed the tier based candidate selection
  • Default tiers are powers of 4 with the first tier being 0-4M followed by 4-16M, 16-64M and so on.
  • Fixed consolidation window of size 4

@k0ushal k0ushal self-assigned this May 30, 2025
@k0ushal
Copy link
Author

k0ushal commented May 30, 2025

Documentation:
https://github.com/arangodb/documents/pull/145

@k0ushal k0ushal requested a review from alexbakharew May 30, 2025 08:09
@k0ushal k0ushal marked this pull request as draft July 9, 2025 08:18
@k0ushal k0ushal force-pushed the bugfix/consolidation-issues branch from 57714c9 to c1e6ebb Compare July 11, 2025 12:34
@k0ushal k0ushal changed the base branch from master to bugfix/iresearch-address-table-tests July 14, 2025 09:03
@k0ushal k0ushal changed the base branch from bugfix/iresearch-address-table-tests to master July 14, 2025 09:04
@k0ushal k0ushal changed the base branch from master to bugfix/iresearch-address-table-tests July 14, 2025 09:05
Copy link
Member

@goedderz goedderz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments as we talked about. Looks good to me!

Comment on lines 51 to 57
mergeBytes += itrMeta->byte_size;
skew = static_cast<double>(itrMeta->byte_size) / mergeBytes;
delCount += (itrMeta->docs_count - itrMeta->live_docs_count);
mergeScore = skew + (1.0 / (1 + delCount));
cost = mergeBytes * mergeScore;

size_t size_before_consolidation = 0;
size_t size_after_consolidation = 0;
size_t size_after_consolidation_floored = 0;
for (auto& segment_stat : consolidation) {
size_before_consolidation += segment_stat.meta->byte_size;
size_after_consolidation += segment_stat.size;
size_after_consolidation_floored +=
std::max(segment_stat.size, floor_segment_bytes);
} while (itr++ != end);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably inconsequential, but it would suffice to calculate skew, mergeScore and cost once after the loop for the last element.

Comment on lines 90 to 92
size_t nextTier = ConsolidationConfig::tier1;
while (nextTier < num)
nextTier = nextTier << 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: You could probably use std::countl_zero and get rid of the loop.

mergeBytes = mergeBytes - removeMeta->byte_size + addMeta->byte_size;
skew = static_cast<double>(addMeta->byte_size) / mergeBytes;
delCount = delCount - getDelCount(removeMeta) + getDelCount(addMeta);
mergeScore = skew + (1 / (1 + delCount));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As already discussed:

We should think about whether calculating the mergeScore this way is sensible. What seems strange is that while the skew is a ratio (of byte-sizes), the second summand is an inverse count. This seems off: intuitively I'd expect e.g. a ratio of live and total documents to be considered alongside the skew.

Copy link
Member

@goedderz goedderz Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually quite bad the way it is, worse than we noticed yesterday @k0ushal.

Note that $\mathrm{skew} \in (0, 1)$. With $\mathrm{delCount} = 1$, we get

$$\begin{align*} \mathrm{mergeScore} &= \mathrm{skew} + \frac{1}{1 + \mathrm{delCount}} \\\ &= \mathrm{skew} + \frac 1 2 \\\ &\leq 1 \frac 1 2 \\\ &= \mathrm{maxMergeScore} \end{align*}$$

.

So this way we are always allowed to consolidate if only one document has been deleted, no matter the size of the files or number of documents therein.

Let us at least do

    mergeScore = skew + live_docs_count / total_docs_count;

instead, as discussed - this has more reasonable properties.

And as a second observation @neunhoef made today while discussing this: Adding these two values is probably not right, either. They should be multiplied instead; the maxMergeScore will need to be adjusted to 0.5 to get a similar effect.

So we should actually do

    mergeScore = skew * live_docs_count / total_docs_count;

(and adapt maxMergeScore).

To understand this better, we should still do some formal worst-case analysis and some tests (specifically unit tests of the consolidation algorithm that play out certain usage scenarios).

Comment on lines 162 to 241
for (auto idx = start; idx != sorted_segments.end();) {

if (getSize(*idx) <= currentTier) {
idx++;
continue;
}

tiers.emplace_back(start, idx - 1);

// The next tier may not necessarily be in the
// next power of 4.
// Consider this example,
// [2, 4, 6, 8, 900]
// While the 2, 4 fall in the 0-4 tier and 6, 8 fall
// in the 4-16 tier, the last segment falls in
// the [256-1024] tier.

currentTier = getConsolidationTier(getSize(*idx));
start = idx++;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed: finding the tier-boundaries could be done by binary search, possibly utilizing std::lower_bound / std::upper_bound.

mergeBytes = mergeBytes - removeMeta->byte_size + addMeta->byte_size;
skew = static_cast<double>(addMeta->byte_size) / mergeBytes;
delCount = delCount - getDelCount(removeMeta) + getDelCount(addMeta);
mergeScore = skew + (1 / (1 + delCount));
Copy link
Member

@goedderz goedderz Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually quite bad the way it is, worse than we noticed yesterday @k0ushal.

Note that $\mathrm{skew} \in (0, 1)$. With $\mathrm{delCount} = 1$, we get

$$\begin{align*} \mathrm{mergeScore} &= \mathrm{skew} + \frac{1}{1 + \mathrm{delCount}} \\\ &= \mathrm{skew} + \frac 1 2 \\\ &\leq 1 \frac 1 2 \\\ &= \mathrm{maxMergeScore} \end{align*}$$

.

So this way we are always allowed to consolidate if only one document has been deleted, no matter the size of the files or number of documents therein.

Let us at least do

    mergeScore = skew + live_docs_count / total_docs_count;

instead, as discussed - this has more reasonable properties.

And as a second observation @neunhoef made today while discussing this: Adding these two values is probably not right, either. They should be multiplied instead; the maxMergeScore will need to be adjusted to 0.5 to get a similar effect.

So we should actually do

    mergeScore = skew * live_docs_count / total_docs_count;

(and adapt maxMergeScore).

To understand this better, we should still do some formal worst-case analysis and some tests (specifically unit tests of the consolidation algorithm that play out certain usage scenarios).

@k0ushal k0ushal force-pushed the bugfix/iresearch-address-table-tests branch from 07286d8 to 872d553 Compare July 16, 2025 19:35
@k0ushal k0ushal force-pushed the bugfix/consolidation-issues branch 2 times, most recently from fb73fcd to f6305e3 Compare July 17, 2025 07:34
@k0ushal k0ushal deleted the branch master July 18, 2025 07:56
@k0ushal k0ushal closed this Jul 18, 2025
@goedderz goedderz reopened this Jul 23, 2025
@goedderz goedderz changed the base branch from bugfix/iresearch-address-table-tests to master July 23, 2025 13:02
@k0ushal k0ushal force-pushed the bugfix/consolidation-issues branch from f6305e3 to 21a2f95 Compare July 23, 2025 13:05
k0ushal added 2 commits July 24, 2025 15:42
- Fixed the tier based candidate selection
- Default tiers are powers of 4 with the first tier
being 0-4M followed by 4-16M, 16-64M and so on.
- Fixed consolidation window of size 4
@k0ushal k0ushal force-pushed the bugfix/consolidation-issues branch from 21a2f95 to d91b909 Compare July 24, 2025 15:43
@k0ushal k0ushal requested a review from goedderz August 25, 2025 07:49
@k0ushal k0ushal force-pushed the bugfix/consolidation-issues branch from 79070ae to 5165f01 Compare August 26, 2025 07:41
@k0ushal k0ushal force-pushed the bugfix/consolidation-issues branch from 5165f01 to 9cfc1fc Compare August 26, 2025 07:56
@k0ushal k0ushal marked this pull request as ready for review August 26, 2025 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants