checker, scatter, server: add load-based split-scatter with range-aware group baseline#10621
checker, scatter, server: add load-based split-scatter with range-aware group baseline#10621lhy1024 wants to merge 17 commits intotikv:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRecords load-triggered batch splits, tracks pending split-scatter tasks with TTL/retry/waitVersion, dispatches internal scatter operators with shared group IDs using range hints, adds a Checker patrol phase metric, and includes unit and integration tests plus an internal ScatterInternal API. Changes
Sequence Diagram(s)sequenceDiagram
participant Store as TiKV Store
participant ClusterWorker as Cluster Worker
participant CheckerController as Checker Controller
participant SplitScatter as Split-Scatter Controller
participant Scatterer as Region Scatterer
Store->>ClusterWorker: AskBatchSplit (SplitReason_LOAD)
ClusterWorker->>ClusterWorker: collect newRegionIDs
ClusterWorker->>CheckerController: RecordSplitScatterBatch(sourceRegionID, newRegionIDs)
CheckerController->>SplitScatter: create pending entries (group, TTL, waitVersion)
Store->>ClusterWorker: Region heartbeats (new regions)
ClusterWorker->>CheckerController: patrol tick (dispatch phase)
CheckerController->>SplitScatter: dispatchSplitScatterRegions()
SplitScatter->>SplitScatter: filter non-expired & eligible, check waitVersion
SplitScatter->>SplitScatter: verify replica counts, resolve range hint
SplitScatter->>Scatterer: ScatterInternal(region, group, startKey, endKey)
Scatterer->>Scatterer: select peers/leaders (local state), create operator
Scatterer-->>SplitScatter: operator or error
alt success
SplitScatter->>SplitScatter: remove pending entry
else insufficient replicas or throttled or failure
SplitScatter->>SplitScatter: set retryAt/backoff
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…re group baseline Signed-off-by: lhy1024 <admin@liudos.us>
5721e5b to
1d54bda
Compare
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (6)
pkg/schedule/checker/checker_controller.go (1)
171-174:measure(... phaseCheckPending ...)now covers two distinct phases.The histogram bucket
phaseCheckPendingpreviously timed onlycheckPendingProcessedRegions. IncludingdispatchSplitScatterRegionsin the samemeasureblock silently changes the metric's semantics and could mask regressions in either phase. Consider a dedicated phase for split-scatter dispatch (or at least a separatemeasurecall inside this lambda).♻️ Proposed split
- measure(c.metrics.patrolPhaseHistograms[phaseCheckPending], func() { - c.splitScatter.dispatchSplitScatterRegions() - c.checkPendingProcessedRegions() - }) + measure(c.metrics.patrolPhaseHistograms[phaseDispatchSplitScatter], func() { + c.splitScatter.dispatchSplitScatterRegions() + }) + measure(c.metrics.patrolPhaseHistograms[phaseCheckPending], func() { + c.checkPendingProcessedRegions() + })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/schedule/checker/checker_controller.go` around lines 171 - 174, The current measure call around c.splitScatter.dispatchSplitScatterRegions() and c.checkPendingProcessedRegions() mixes two phases into c.metrics.patrolPhaseHistograms[phaseCheckPending]; separate them so each phase is timed independently: call measure(...) once around c.splitScatter.dispatchSplitScatterRegions() (using a new histogram key like phaseDispatchSplitScatter or another appropriate histogram in c.metrics.patrolPhaseHistograms) and keep a separate measure(...) around c.checkPendingProcessedRegions() that continues to use phaseCheckPending; update any histogram key enums/constants to include the new phase name and ensure both measure invocations reference the correct histogram entries.pkg/schedule/checker/split_scatter.go (2)
149-165:recordSplitScatterBatchcallscluster.GetRegionwhile holdingpendingMu— minor risk.Line 161 reads from the cluster while the split-scatter write lock is held.
GetRegiontypically takes a separate basic-cluster RW lock, so today this is safe, but it's a non-trivial cross-lock acquisition pattern. Consider readingsourceRegion's version before takingpendingMuto keep the critical section purely local to the pending map and lock-ordering trivially correct.♻️ Suggested ordering
func (c *splitScatterController) recordSplitScatterBatch(sourceRegionID uint64, newRegionIDs []uint64) { if len(newRegionIDs) == 0 { return } group := makeSplitScatterGroup(sourceRegionID, newRegionIDs[0]) expireAt := time.Now().Add(splitScatterPendingTTL) + sourcePending := splitScatterPendingItem{regionID: sourceRegionID, group: group, waitVersion: 1, expireAt: expireAt} + if sourceRegion := c.cluster.GetRegion(sourceRegionID); sourceRegion != nil && sourceRegion.GetRegionEpoch() != nil { + sourcePending.waitVersion = sourceRegion.GetRegionEpoch().GetVersion() + 1 + } c.pendingMu.Lock() defer c.pendingMu.Unlock() for _, regionID := range newRegionIDs { c.pending[regionID] = splitScatterPendingItem{regionID: regionID, group: group, expireAt: expireAt} } - sourcePending := splitScatterPendingItem{regionID: sourceRegionID, group: group, waitVersion: 1, expireAt: expireAt} - if sourceRegion := c.cluster.GetRegion(sourceRegionID); sourceRegion != nil && sourceRegion.GetRegionEpoch() != nil { - sourcePending.waitVersion = sourceRegion.GetRegionEpoch().GetVersion() + 1 - } c.pending[sourceRegionID] = sourcePending }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/schedule/checker/split_scatter.go` around lines 149 - 165, The function recordSplitScatterBatch currently calls c.cluster.GetRegion while holding c.pendingMu which creates a cross-lock acquisition; to fix, call c.cluster.GetRegion(sourceRegionID) and compute sourcePending.waitVersion (using sourceRegion.GetRegionEpoch().GetVersion()+1 if non-nil) before acquiring c.pendingMu, then lock c.pendingMu and populate c.pending with splitScatterPendingItem entries (including the precomputed sourcePending) and set expireAt; keep all other logic (makeSplitScatterGroup, splitScatterPendingTTL) the same and only move the cluster read out of the critical section.
33-37: Hard-coded tuning constants — PR objective calls for them to be configurable.
splitScatterDispatchLimit,splitScatterRetryBackoff, andsplitScatterPendingTTLare package-level constants. Issue#10592mentions "configurable limit" for the per-tick dispatch. For a feature that may need ops tuning under load, please thread these throughconfig.CheckerConfigProvider(or a new sub-config) so they're adjustable without recompile.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/schedule/checker/split_scatter.go` around lines 33 - 37, Replace the package-level constants splitScatterDispatchLimit, splitScatterRetryBackoff, and splitScatterPendingTTL with configurable fields supplied by the existing config.CheckerConfigProvider (or add a small SplitScatter sub-config on that provider); update split_scatter.go to read defaults from the provider (using existing sensible defaults equal to the current constants) and reference the provider values wherever splitScatterDispatchLimit, splitScatterRetryBackoff, and splitScatterPendingTTL were used (e.g., in the dispatch loop, retry logic, and TTL checks). Add accessor methods or struct fields on CheckerConfigProvider (or a new SplitScatterConfig struct) so callers can obtain int dispatchLimit, time.Duration retryBackoff, and time.Duration pendingTTL; ensure unit tests or call sites are updated to use the provider-based values and preserve current behavior when the provider returns zero/nil by falling back to the previous constants as defaults.pkg/mcs/scheduling/server/grpc_service.go (1)
443-445: TODO is fine, but please consider opening a tracking issue.The TODO references the NEXT_GEN/scheduling-service path — acceptable for this PR since the load-based split-scatter only runs in monolithic PD mode today. To avoid drift, consider linking a tracking issue ID in the comment so the gap doesn't get lost once this lands.
Want me to draft a follow-up issue referencing this TODO and the matching one in
server/grpc_service.go?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/mcs/scheduling/server/grpc_service.go` around lines 443 - 445, The TODO about load-based split-scatter in grpc_service.go should be linked to a tracking issue: create a short follow-up issue describing the gap (recording split-scatter batches and plumbing SplitReason for scheduling-service/NEXT_GEN) and then update the TODO comment near the existing "TODO(split-scatter)" (and the matching TODO in server/grpc_service.go) to include the issue number/URL (e.g. "TODO(split-scatter): tracking: ISSUE-1234") so future readers can find the follow-up; ensure the comment mentions the feature ("load-based split-scatter"), the affected path ("scheduling-service/NEXT_GEN"), and the intended work ("record split-scatter batches and plumb SplitReason").pkg/schedule/scatter/region_scatterer.go (2)
374-400:newInternalScatterStatecost scales with total running ops, not just this group.Every call to
ScatterInternalrebuilds the scatter state from scratch:
seedScatterStateByRangewalks all stores and callsGetStorePeerCountByRange/GetStoreLeaderCountByRangeper store.applyRunningScatterOpsDeltaiterates the entireopController.GetOperators()set, even though only ops withDesc() == InternalScatterOperatorDescand matchinggroupcontribute. For each match, it then callsr.cluster.GetRegion(op.RegionID()).For a split-scatter batch dispatching N regions sequentially, this is
O(N * total_running_ops). In a busy cluster (10k+ pending operators), this can become a noticeable per-region tax on a hot dispatch path.Consider one of:
- caching
statepergroupfor the duration of a batch and only re-applying the delta of newly added ops since the last call;- maintaining an index of internal-scatter ops keyed by
groupin the controller (or a lightweight registry insideRegionScatterer) to avoid the linear scan.Optional — not a correctness issue, but worth a perf check before this lands behind a feature flag.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/schedule/scatter/region_scatterer.go` around lines 374 - 400, The hot path in newInternalScatterState currently scans all ops via opController.GetOperators() in applyRunningScatterOpsDelta (filtering by InternalScatterOperatorDesc and group) and calls r.cluster.GetRegion for each match, causing O(N * total_running_ops) cost; fix by avoiding a full scan — either cache the built scatterState per group for the lifetime of a scatter batch and only apply deltas for newly added ops, or maintain an index of internal-scatter operators keyed by group (in the controller or a lightweight map inside RegionScatterer) so applyRunningScatterOpsDelta can iterate only relevant ops; update newInternalScatterState to use the group-cache or indexed lookup and change applyRunningScatterOpsDelta to iterate the filtered set (still using finalPlacementAfterOperator and r.cluster.GetRegion for matched ops) and ensure cache invalidation when ops finish or group batch ends.
765-808: Internal-scatter tie-break is hard to follow; consider extracting a comparator.The compound condition
if storeCount < minCount || (internalScatter && storeCount == minCount && (storeRegionCount < bestStoreRegionCount || (storeRegionCount == bestStoreRegionCount && (newPeer == nil || store.GetID() < newPeer.GetStoreId())))) {mixes "strictly better" with three internal-only tie-breakers, making the intent (lower group count → fewer total regions → smaller store ID, deterministic) hard to verify at a glance. A small helper (e.g.,
betterCandidate(internalScatter, storeCount, minCount, storeRegionCount, bestStoreRegionCount, store, newPeer) bool) or a tuple comparison would make both the admin and internal paths self-documenting and easier to extend if a future tiebreaker (e.g., score) is added.Also note that
bestStoreRegionCountis logically meaningful only on the internal-scatter path, but it's being maintained on the admin path too. Splitting the admin/internal selection logic would remove that incidental coupling.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/schedule/scatter/region_scatterer.go` around lines 765 - 808, The selection condition in the loop inside (region_scatterer.go) is doing two concerns at once (admin vs internal tie-breakers) and should be extracted into a clear comparator; implement a helper like betterCandidate(internalScatter bool, storeCount, minCount uint64, storeRegionCount, bestStoreRegionCount int, store metapb.Store, newPeer *metapb.Peer) bool that returns true when the current store is strictly a better candidate (first by lower storeCount, then, only when internalScatter and storeCount==minCount, by lower storeRegionCount, then by smaller store ID), replace the compound if with a call to that helper, and ensure bestStoreRegionCount is only read/updated in the internalScatter path so admin selection does not depend on region counts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/schedule/checker/split_scatter_group.go`:
- Around line 54-63: Add a brief clarifying comment immediately before the
conditional that compares region.GetEndKey() to indexRange.endKey stating that
both endKey (from region.GetEndKey()) and indexRange.startKey/indexRange.endKey
(from codec.EncodeBytes()) are MemComparable-encoded, so the lexicographic
bytes.Compare is valid; leave the logic unchanged (the if using bytes.Compare
and the call to splitScatterPrefixRange(tablePrefix) should remain as-is).
In `@pkg/schedule/checker/split_scatter.go`:
- Around line 167-212: dispatchSplitScatterRegions currently retries pending
entries without backoff on two failure paths (ScatterInternal error and
opController.AddOperator false), causing busy-retry/log spam; ensure every
non-success path calls delayPendingSplitScatter with splitScatterRetryBackoff
before continuing. Specifically, inside dispatchSplitScatterRegions handle the
error return from c.regionScatterer.ScatterInternal by calling
c.delayPendingSplitScatter(pending.regionID, splitScatterRetryBackoff) before
the log/continue, and likewise when c.opController.AddOperator(op) returns false
call c.delayPendingSplitScatter(pending.regionID, splitScatterRetryBackoff)
before logging and continue; keep existing pending deletion logic unchanged for
success paths.
- Around line 78-127: collectTopPendingSplitScatter currently selects candidates
from the map c.pending in randomized map order and truncates to limit, so add a
clear TODO referencing issue `#10592` to indicate priority ordering is
intentionally deferred; specifically, inside collectTopPendingSplitScatter after
building the candidates slice (before truncating or returning) insert a TODO
comment mentioning `#10592` and that candidates should be sorted by the region
priority score (CPU/byte traffic) or stored in a heap for bounded priority
behavior (or implement sorting of candidates by score before truncation if you
choose to implement now), referencing symbols candidates, pending, and
splitScatterDispatchLimit to guide where to change.
In `@pkg/schedule/scatter/region_scatterer_test.go`:
- Around line 1330-1335: The test currently allows op to be nil and only checks
its descriptor conditionally, which can hide regressions in ScatterInternal;
change the assertion to require op to be non-nil (use re.NotNil(op)) immediately
after re.NoError(err) and then assert re.Equal(InternalScatterOperatorDesc,
op.Desc()); if a (nil, nil) result is actually acceptable in some setups (e.g.,
due to isSameDistribution), instead make the test pre-arrange the region/cluster
state so a scatter is required (or assert the scatter counter is incremented)
before relaxing the NotNil requirement for ScatterInternal in
region_scatterer_test.go.
In `@server/cluster/split_scatter_test.go`:
- Around line 137-141: The test currently relies on a 50ms sleep inside
dispatchSplitScatterInPatrol to hope the patrol run completes, which is flaky;
modify the test to wait deterministically for at least one patrol pending-check
pass before asserting no operator was created by polling a visible signal such
as cluster.GetPatrolRegionsDuration() (or a newly exposed patrol run counter)
until it advances, or loop with a short timeout checking that
GetPatrolRegionsDuration() has increased, then call
cluster.GetOperatorController().GetOperator(splitRegionID) and assert nil; keep
references to dispatchSplitScatterInPatrol, GetPatrolRegionsDuration (or the new
counter), GetOperatorController().GetOperator, and splitRegionID so the change
is localized and deterministic.
---
Nitpick comments:
In `@pkg/mcs/scheduling/server/grpc_service.go`:
- Around line 443-445: The TODO about load-based split-scatter in
grpc_service.go should be linked to a tracking issue: create a short follow-up
issue describing the gap (recording split-scatter batches and plumbing
SplitReason for scheduling-service/NEXT_GEN) and then update the TODO comment
near the existing "TODO(split-scatter)" (and the matching TODO in
server/grpc_service.go) to include the issue number/URL (e.g.
"TODO(split-scatter): tracking: ISSUE-1234") so future readers can find the
follow-up; ensure the comment mentions the feature ("load-based split-scatter"),
the affected path ("scheduling-service/NEXT_GEN"), and the intended work
("record split-scatter batches and plumb SplitReason").
In `@pkg/schedule/checker/checker_controller.go`:
- Around line 171-174: The current measure call around
c.splitScatter.dispatchSplitScatterRegions() and
c.checkPendingProcessedRegions() mixes two phases into
c.metrics.patrolPhaseHistograms[phaseCheckPending]; separate them so each phase
is timed independently: call measure(...) once around
c.splitScatter.dispatchSplitScatterRegions() (using a new histogram key like
phaseDispatchSplitScatter or another appropriate histogram in
c.metrics.patrolPhaseHistograms) and keep a separate measure(...) around
c.checkPendingProcessedRegions() that continues to use phaseCheckPending; update
any histogram key enums/constants to include the new phase name and ensure both
measure invocations reference the correct histogram entries.
In `@pkg/schedule/checker/split_scatter.go`:
- Around line 149-165: The function recordSplitScatterBatch currently calls
c.cluster.GetRegion while holding c.pendingMu which creates a cross-lock
acquisition; to fix, call c.cluster.GetRegion(sourceRegionID) and compute
sourcePending.waitVersion (using sourceRegion.GetRegionEpoch().GetVersion()+1 if
non-nil) before acquiring c.pendingMu, then lock c.pendingMu and populate
c.pending with splitScatterPendingItem entries (including the precomputed
sourcePending) and set expireAt; keep all other logic (makeSplitScatterGroup,
splitScatterPendingTTL) the same and only move the cluster read out of the
critical section.
- Around line 33-37: Replace the package-level constants
splitScatterDispatchLimit, splitScatterRetryBackoff, and splitScatterPendingTTL
with configurable fields supplied by the existing config.CheckerConfigProvider
(or add a small SplitScatter sub-config on that provider); update
split_scatter.go to read defaults from the provider (using existing sensible
defaults equal to the current constants) and reference the provider values
wherever splitScatterDispatchLimit, splitScatterRetryBackoff, and
splitScatterPendingTTL were used (e.g., in the dispatch loop, retry logic, and
TTL checks). Add accessor methods or struct fields on CheckerConfigProvider (or
a new SplitScatterConfig struct) so callers can obtain int dispatchLimit,
time.Duration retryBackoff, and time.Duration pendingTTL; ensure unit tests or
call sites are updated to use the provider-based values and preserve current
behavior when the provider returns zero/nil by falling back to the previous
constants as defaults.
In `@pkg/schedule/scatter/region_scatterer.go`:
- Around line 374-400: The hot path in newInternalScatterState currently scans
all ops via opController.GetOperators() in applyRunningScatterOpsDelta
(filtering by InternalScatterOperatorDesc and group) and calls
r.cluster.GetRegion for each match, causing O(N * total_running_ops) cost; fix
by avoiding a full scan — either cache the built scatterState per group for the
lifetime of a scatter batch and only apply deltas for newly added ops, or
maintain an index of internal-scatter operators keyed by group (in the
controller or a lightweight map inside RegionScatterer) so
applyRunningScatterOpsDelta can iterate only relevant ops; update
newInternalScatterState to use the group-cache or indexed lookup and change
applyRunningScatterOpsDelta to iterate the filtered set (still using
finalPlacementAfterOperator and r.cluster.GetRegion for matched ops) and ensure
cache invalidation when ops finish or group batch ends.
- Around line 765-808: The selection condition in the loop inside
(region_scatterer.go) is doing two concerns at once (admin vs internal
tie-breakers) and should be extracted into a clear comparator; implement a
helper like betterCandidate(internalScatter bool, storeCount, minCount uint64,
storeRegionCount, bestStoreRegionCount int, store metapb.Store, newPeer
*metapb.Peer) bool that returns true when the current store is strictly a better
candidate (first by lower storeCount, then, only when internalScatter and
storeCount==minCount, by lower storeRegionCount, then by smaller store ID),
replace the compound if with a call to that helper, and ensure
bestStoreRegionCount is only read/updated in the internalScatter path so admin
selection does not depend on region counts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ee393cde-82b5-49fb-a777-a32aa1f457f5
📒 Files selected for processing (10)
pkg/mcs/scheduling/server/grpc_service.gopkg/schedule/checker/checker_controller.gopkg/schedule/checker/split_scatter.gopkg/schedule/checker/split_scatter_group.gopkg/schedule/checker/split_scatter_test.gopkg/schedule/scatter/region_scatterer.gopkg/schedule/scatter/region_scatterer_test.goserver/cluster/cluster_worker.goserver/cluster/split_scatter_test.goserver/grpc_service.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #10621 +/- ##
==========================================
+ Coverage 78.96% 79.02% +0.05%
==========================================
Files 532 534 +2
Lines 71883 72443 +560
==========================================
+ Hits 56766 57251 +485
- Misses 11093 11144 +51
- Partials 4024 4048 +24
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Signed-off-by: lhy1024 <admin@liudos.us>
29a8841 to
77a5808
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/schedule/scatter/region_scatterer_test.go (1)
521-573:⚠️ Potential issue | 🟠 MajorRegister per-case cleanup immediately after enabling
scatterFailand creating the stream/context.If any
requirefails after Line 543, the failpoint stays enabled and the cleanup at Lines 572-573 never runs. That can contaminate later tests in this package and leak goroutines/streams. Wrap each case in a subtest/closure with deferred cleanup, or register cleanup right after allocation/enabling.As per coding guidelines,
**/*_test.go: Tests must handle failpoints usinggithub.com/pingcap/failpoint.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/schedule/scatter/region_scatterer_test.go` around lines 521 - 573, After creating ctx/cancel and stream and immediately after enabling the "github.com/tikv/pd/pkg/schedule/scatter/scatterFail" failpoint for a test case, register per-case deferred cleanup so resources and failpoints are always cleared: i.e., right after ctx, cancel, stream := ... add defer stream.Close() and defer cancel(), and when you call failpoint.Enable(".../scatterFail", ...) add a matching deferred failpoint.Disable(".../scatterFail") (only when enable succeeded) so the failpoint is disabled even if requires/asserts fail later; refer to symbols ctx, cancel, stream, and the failpoint name "github.com/tikv/pd/pkg/schedule/scatter/scatterFail" and ensure the disable call runs before the loop continues.
♻️ Duplicate comments (1)
server/cluster/split_scatter_test.go (1)
100-115:⚠️ Potential issue | 🟡 MinorDerive the expected group from
splitRegionIDinstead of hardcoding1.Line 115 bakes in the allocator’s current first ID, but the contract is
split-scatter-{sourceRegionID}-{firstNewRegionID}. If another setup path consumes IDs earlier, this test fails even though the behavior is correct.🛠️ Suggested fix
+ "fmt" "strings" "testing" "time" @@ - re.Equal("split-scatter-100-1", opGroup) + re.Equal(fmt.Sprintf("split-scatter-100-%d", splitRegionID), opGroup)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/cluster/split_scatter_test.go` around lines 100 - 115, The test hardcodes the expected operator group string ("split-scatter-100-1"); instead build it using the computed splitRegionID so the test reads the actual firstNewRegionID: create an expectedGroup string (e.g. via fmt.Sprintf("split-scatter-100-%d", splitRegionID) or strconv.FormatUint) and replace the literal in the re.Equal check that compares opGroup from op.GetAdditionalInfo("group") to the hardcoded value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/schedule/checker/split_scatter.go`:
- Around line 131-140: The fix is to ensure delayPendingSplitScatter (and
similar delete/update handlers) only back off or remove a pending entry if it
still matches the same snapshot/batch that the patrol pass observed: capture an
identifier or pointer from collectTopPendingSplitScatter (e.g., the pending
entry's batchID or the pending struct pointer) and pass that expected identity
into delayPendingSplitScatter / delete handlers, then inside those functions
obtain the current c.pending[regionID] under c.pendingMu and compare the
identities before mutating; if they differ, do nothing. This prevents an older
patrol snapshot from overwriting a newer RecordSplitScatter() update for the
same region.
---
Outside diff comments:
In `@pkg/schedule/scatter/region_scatterer_test.go`:
- Around line 521-573: After creating ctx/cancel and stream and immediately
after enabling the "github.com/tikv/pd/pkg/schedule/scatter/scatterFail"
failpoint for a test case, register per-case deferred cleanup so resources and
failpoints are always cleared: i.e., right after ctx, cancel, stream := ... add
defer stream.Close() and defer cancel(), and when you call
failpoint.Enable(".../scatterFail", ...) add a matching deferred
failpoint.Disable(".../scatterFail") (only when enable succeeded) so the
failpoint is disabled even if requires/asserts fail later; refer to symbols ctx,
cancel, stream, and the failpoint name
"github.com/tikv/pd/pkg/schedule/scatter/scatterFail" and ensure the disable
call runs before the loop continues.
---
Duplicate comments:
In `@server/cluster/split_scatter_test.go`:
- Around line 100-115: The test hardcodes the expected operator group string
("split-scatter-100-1"); instead build it using the computed splitRegionID so
the test reads the actual firstNewRegionID: create an expectedGroup string (e.g.
via fmt.Sprintf("split-scatter-100-%d", splitRegionID) or strconv.FormatUint)
and replace the literal in the re.Equal check that compares opGroup from
op.GetAdditionalInfo("group") to the hardcoded value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bd186a34-515c-4844-89b8-498982b1ec93
📒 Files selected for processing (7)
pkg/schedule/checker/checker_controller.gopkg/schedule/checker/metrics.gopkg/schedule/checker/split_scatter.gopkg/schedule/checker/split_scatter_group.gopkg/schedule/checker/split_scatter_test.gopkg/schedule/scatter/region_scatterer_test.goserver/cluster/split_scatter_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/schedule/checker/checker_controller.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/cluster/split_scatter_test.go`:
- Around line 59-80: The test stops patrol as soon as one split child gets an
operator, allowing the other child to be never dispatched; change the wait
condition so patrol is only cancelled when both children have operators: update
the predicate passed to dispatchSplitScatterInPatrol to return true only when
cluster.GetOperatorController().GetOperator(splitRegionIDs[0]) != nil AND
cluster.GetOperatorController().GetOperator(splitRegionIDs[1]) != nil (or
alternatively assert both operators are non-nil after patrol by removing the nil
skip and failing the test if either op is nil); adjust references to
splitRegionIDs and GetOperatorController/GetOperator accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fbcf49af-d825-4076-b949-a77d2b3c1598
📒 Files selected for processing (4)
pkg/schedule/checker/split_scatter.gopkg/schedule/checker/split_scatter_test.gopkg/schedule/scatter/region_scatterer_test.goserver/cluster/split_scatter_test.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/schedule/checker/split_scatter_test.go (1)
302-347: Consider de-duplicating region metadata construction in test helpers.
putSplitScatterRegionWithStoresandputSplitScatterRegionWithoutLeaderrepeat the same region payload assembly; extracting one builder helper would reduce drift risk in future test updates.♻️ Optional refactor sketch
+func newSplitScatterRegionMeta(regionID uint64, startKey, endKey string, peers []*metapb.Peer) *metapb.Region { + return &metapb.Region{ + Id: regionID, + StartKey: []byte(startKey), + EndKey: []byte(endKey), + Peers: peers, + RegionEpoch: &metapb.RegionEpoch{ + ConfVer: 1, + Version: 1, + }, + } +} + func putSplitScatterRegionWithStores(tc *mockcluster.Cluster, regionID uint64, startKey, endKey string, cpu uint64, stores ...uint64) { peers := make([]*metapb.Peer, 0, len(stores)) for i, storeID := range stores { peers = append(peers, &metapb.Peer{ Id: regionID*10 + uint64(i) + 1, StoreId: storeID, }) } - region := &metapb.Region{ - Id: regionID, - StartKey: []byte(startKey), - EndKey: []byte(endKey), - Peers: peers, - RegionEpoch: &metapb.RegionEpoch{ - ConfVer: 1, - Version: 1, - }, - } + region := newSplitScatterRegionMeta(regionID, startKey, endKey, peers) tc.PutRegion(core.NewRegionInfo( region, peers[0], core.SetCPUUsage(cpu), )) } func putSplitScatterRegionWithoutLeader(tc *mockcluster.Cluster, regionID uint64, startKey, endKey string, cpu uint64) { peers := []*metapb.Peer{ {Id: regionID*10 + 1, StoreId: 1}, {Id: regionID*10 + 2, StoreId: 2}, {Id: regionID*10 + 3, StoreId: 3}, } - region := &metapb.Region{ - Id: regionID, - StartKey: []byte(startKey), - EndKey: []byte(endKey), - Peers: peers, - RegionEpoch: &metapb.RegionEpoch{ - ConfVer: 1, - Version: 1, - }, - } + region := newSplitScatterRegionMeta(regionID, startKey, endKey, peers) tc.PutRegion(core.NewRegionInfo( region, nil, core.SetCPUUsage(cpu), )) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/schedule/checker/split_scatter_test.go` around lines 302 - 347, Both putSplitScatterRegionWithStores and putSplitScatterRegionWithoutLeader duplicate region/RegionInfo assembly; extract a shared builder helper (e.g., buildSplitScatterRegion or newRegionInfoForTest) that accepts regionID, startKey, endKey, peers []*metapb.Peer, leader *metapb.Peer, and cpu uint64, constructs the metapb.Region and returns core.NewRegionInfo(..., core.SetCPUUsage(cpu)); then refactor putSplitScatterRegionWithStores to generate peers from stores (keeping the Id/StoreId scheme) and call the new helper with peers and leader=peers[0], and refactor putSplitScatterRegionWithoutLeader to call the helper with the fixed peers and leader=nil.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/schedule/checker/split_scatter_test.go`:
- Around line 302-347: Both putSplitScatterRegionWithStores and
putSplitScatterRegionWithoutLeader duplicate region/RegionInfo assembly; extract
a shared builder helper (e.g., buildSplitScatterRegion or newRegionInfoForTest)
that accepts regionID, startKey, endKey, peers []*metapb.Peer, leader
*metapb.Peer, and cpu uint64, constructs the metapb.Region and returns
core.NewRegionInfo(..., core.SetCPUUsage(cpu)); then refactor
putSplitScatterRegionWithStores to generate peers from stores (keeping the
Id/StoreId scheme) and call the new helper with peers and leader=peers[0], and
refactor putSplitScatterRegionWithoutLeader to call the helper with the fixed
peers and leader=nil.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8af34fc2-4622-4d5c-ae65-91d667457bca
📒 Files selected for processing (1)
pkg/schedule/checker/split_scatter_test.go
Signed-off-by: lhy1024 <admin@liudos.us>
15549f0 to
6af472e
Compare
Signed-off-by: lhy1024 <admin@liudos.us>
| } | ||
| cli := forwardCli.getClient() | ||
| if cli != nil { | ||
| // TODO(split-scatter): propagate split reason and support load-based split-scatter |
There was a problem hiding this comment.
This makes load-split scatter behave differently between monolithic PD and scheduling-service/NEXT_GEN deployments. If that gap is intentional for this PR, can we call it out with a feature gate or release note so operators are not surprised?
There was a problem hiding this comment.
I think it's okay to add it. The reason I didn't add it before was because I was worried that this PR would be too big since it was intended as a patch.
| now := time.Now() | ||
| c.pendingMu.RLock() | ||
|
|
||
| candidates := make([]splitScatterPendingItem, 0, len(c.pending)) |
There was a problem hiding this comment.
This runs on every patrol tick, but it scans the whole pending map before taking only 4 items. Under a load-split burst, that can become steady O(pending) work for up to the 3m TTL. Can we use a bounded queue keyed by retry/expire/priority, or at least cap the pending size?
There was a problem hiding this comment.
In the initial version, I added a priority queue sorted by CPU, but I removed it for reasons similar to avoiding excessively high pull requests. I will adopt a simpler implementation method.
| return nil, nil | ||
| } | ||
| op, err := operator.CreateScatterRegionOperator(scatterOperatorDesc, r.cluster, region, targetPeers, targetLeader, skipStoreLimit) | ||
| op, err := operator.CreateScatterRegionOperator(desc, r.cluster, region, targetPeers, targetLeader, skipStoreLimit) |
There was a problem hiding this comment.
This internal path still creates an OpAdmin scatter operator with high priority. Since it is now triggered automatically by load splits, it may compete with regular hot/balance scheduling more aggressively than expected. Should this use a lower priority or be guarded by a config switch?
There was a problem hiding this comment.
Internal scatter is a transient, one-off operation, so I think it's reasonable to give it higher priority?
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
| } | ||
| cli := forwardCli.getClient() | ||
| if cli != nil { | ||
| // TODO(split-scatter): propagate split reason and support load-based split-scatter |
There was a problem hiding this comment.
This TODO is not enough to cover the user-visible deployment-mode difference. Please also state in the PR body / release note that load-split auto scatter only works on monolithic PD, and NEXT_GEN/scheduling-service is not supported yet.
| return nil, nil | ||
| } | ||
| op, err := operator.CreateScatterRegionOperator(scatterOperatorDesc, r.cluster, region, targetPeers, targetLeader, skipStoreLimit) | ||
| op, err := operator.CreateScatterRegionOperator(desc, r.cluster, region, targetPeers, targetLeader, skipStoreLimit) |
There was a problem hiding this comment.
This still creates an OpAdmin operator, but the trigger is an automatic background split-scatter flow. Please either use a non-admin kind, or add tests that pin down the kind / priority / store-limit / metrics semantics.
There was a problem hiding this comment.
internal scatter operator: SchedulerKind=OpAdmin priority=High
There was a problem hiding this comment.
OpAdmin is usually triggered by calling the API manually.
There was a problem hiding this comment.
OpAdmin is usually triggered by calling the API manually.
I changed the internal split-scatter path to build a non-admin scatter operator whose kind is derived from the generated steps. The public/admin scatter path still uses OpAdmin.
| candidates = append(candidates, pending) | ||
| } | ||
| if len(candidates) > limit { | ||
| candidates = candidates[:limit] |
There was a problem hiding this comment.
pending comes from a map, so truncating before sorting makes dispatch random. Please sort by (group, regionID) before truncation to make it deterministic and group-friendly.
| re.NoError(cluster.processRegionHeartbeat(core.ContextTODO(), newSplitScatterRegion(splitRegionIDs[1], []byte("t"), []byte(""), 80))) | ||
|
|
||
| dispatchSplitScatterInPatrol(t, cluster, cancelPatrol, func() bool { | ||
| return cluster.GetOperatorController().GetOperator(splitRegionIDs[0]) != nil || |
There was a problem hiding this comment.
This stops once either child has an operator, and the later loop skips nil operators. The test can pass even if the other child is never scattered. Please wait for both children, or lower the test claim.
| if !ok { | ||
| continue | ||
| } | ||
| if !pending.expireAt.IsZero() && !now.Before(pending.expireAt) { |
There was a problem hiding this comment.
TTL expiration is a silent skip here. Please add split_scatter_pending_expired_total so TTL/capacity-related missed scatters are observable.
| } | ||
| } | ||
| if len(c.pending)+newPendingCount > splitScatterPendingLimit { | ||
| log.Info("skip recording split scatter batch due to pending limit", |
There was a problem hiding this comment.
Dropping pending entries because of the cap should not be log-only. Please add split_scatter_pending_dropped_total, together with the expired counter, to expose the v1 capacity limit.
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us> (cherry picked from commit c706a13) Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
| c.splitScatter.dispatchSplitScatterRegions() | ||
| } | ||
|
|
||
| func TestRecordSplitScatterBatchCollectsPendingRegions(t *testing.T) { |
There was a problem hiding this comment.
Is there a test to cover that one region keeps splitting?
Signed-off-by: lhy1024 <admin@liudos.us>
36e505e to
cc4a132
Compare
|
@lhy1024: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: Ref #10592
What is changed and how does it work?
This PR adds an internal split-scatter flow for newly split regions on the monolithic PD path.
Scope in this PR:
SplitReason_LOADenter the new split-scatter flow.HandleAskBatchSplitrecords the source region and newly allocated region IDs into a short-lived pending state.pingcap/master.Out of scope in this PR:
The scheduling-service / NEXT_GEN gap is left as explicit TODOs in the forwarding code.
Check List
Tests
9.0.0-beta.2(28890b9736b1553a162c63373fdc052bb60c9172) on testbed8069489523a77e2a0f5656c9d6cc78b1dc6b45649147dadtikv-client.copr-cache.capacity-mb = 0)balance-leader,balance-region,balance-hot-region,evict-*) andmerge-schedule-limitset to0UNIFIED_THRESHOLD=0.3, single-region baseline:1 -> 5regions, PD created and finishedinternal-scatter-regionoperatorsUNIFIED_THRESHOLD=0.2, single-region baseline:1 -> 33regions, split-scatter activity concentrated in the first ~3 minutes and then convergedRelease note
Summary by CodeRabbit
New Features
Improvements
Tests
Chores