Skip to content

Conversation

ywangd
Copy link
Member

@ywangd ywangd commented Aug 27, 2025

If a shard starts on the target node before the next ClusterInfo polling, today we don't include it for the simulation. With this PR, we track shards that can potentially start within one ClusterInfo polling cycle so that they are always included in simulation. The tracking is reset when a new ClusterInfo arrives.

Resolves: ES-12723

@ywangd ywangd added >enhancement :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v9.2.0 labels Aug 27, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @ywangd, I've created a changelog YAML for you.

@ywangd
Copy link
Member Author

ywangd commented Aug 27, 2025

I had some back-and-forth with the way to track shardStarted events. At the end, I decided to do it with mostly DesiredBalanceComputer since (1) it is the only place where it is needed and (2) less wiring changes compared to tracking inside InternalClusterInfoService. I am raising it as a draft to seek agreement on the approach. I will work on more tests if we are OK to proceed or I can take a different approach if folks are not happy with the current one. Thanks!

Comment on lines +297 to +306
// Check whether the shard has actually started on the target node
final var startedShard = routingNodes.assignedShards(shardForSimulation.shardId())
.stream()
.filter(
shard -> shard.started()
&& shard.primary() == shardForSimulation.primary()
&& shard.currentNodeId().equals(shardForSimulation.currentNodeId())
)
.findFirst()
.orElse(null);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might feel a bit hacky since normally we should check allocationId to be certain. But simulation and actual shard start event have different allocationId and cannot be compared. Since this check is in a tight loop of balance computation and tracking is reset every ClusterInfo polling, it seems sufficient to rely on other properties, e.g. no two copies of the same shard can be allocated on the same node. There could be some fuzziness here in edge cases. But I think they are not really concerning atm. Happy to take advice.

Comment on lines +121 to +125
// A new ClusterInfo has arrived, clear the tracking for started shards
if (lastClusterInfo != desiredBalanceInput.routingAllocation().clusterInfo()) {
lastClusterInfo = desiredBalanceInput.routingAllocation().clusterInfo();
shardsStartedByAllocate = new HashMap<>();
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one issue with this approach is that a shard allocated in last polling of ClusterInfo but started in this polling cycle is not accounted for, i.e. the following sequence of events:

  1. 1st CluserInfo poll
  2. Allocator moves a shard, but it has not started on the target node
  3. 2nd ClusterInfo poll
  4. The shard starts on the target node

I think we want the started shard contributing to the simulations performed in the 2nd ClusterInfo polling cycle?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that not be simulated by the shard started done in the beginning of compute?

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind the approach but I think it adds some complexity and state to an already quite complex/stateful bit of code

shardsStartedByAllocate = new HashMap<>();
}

final var alreadySimulatedStartedShards = new HashSet<ShardForSimulation>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we ask the simulator for the "alreadySimulatedShards" instead of tracking them alongside the simulateShardStarted calls? or am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean tracking it inside ClusterInfoSimulator? Yes that's possible. It does not do that currently but we can make it do so.

@ywangd
Copy link
Member Author

ywangd commented Aug 28, 2025

it adds some complexity and state to an already quite complex/stateful bit of code

I think it will have to add some complexity. But if we track the real shard started events, the complexity might be a bit less in DesiredBalanceComputer. I am thinking switching to that also because of this comment

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few initial comments, did not get into the weeds of the started simulations yet

return currentClusterInfo;
}

private void updateAndGetCurrentClusterInfo() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method name here hints that it should return the cluster info? That would seem nice to do, but I'd also be fine to just call it updateClusterInfo

Comment on lines +121 to +125
// A new ClusterInfo has arrived, clear the tracking for started shards
if (lastClusterInfo != desiredBalanceInput.routingAllocation().clusterInfo()) {
lastClusterInfo = desiredBalanceInput.routingAllocation().clusterInfo();
shardsStartedByAllocate = new HashMap<>();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that not be simulated by the shard started done in the beginning of compute?

@@ -77,6 +80,8 @@ public class DesiredBalanceComputer {
private long lastConvergedTimeMillis;
private long lastNotConvergedLogMessageTimeMillis;
private Level convergenceLogMsgLevel;
private ClusterInfo lastClusterInfo;
private Map<ShardForSimulation, ShardRouting> shardsStartedByAllocate;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we see this as input instead? I think I prefer to track this outside this class rather than being this stateful here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could even reside on or be accessible through the ClusterInfo object?

@ywangd
Copy link
Member Author

ywangd commented Aug 28, 2025

I got some new idea after talking to Henning. I'll rework this PR. Please hold on your reviews. Thanks! 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants