-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
Is your feature request related to a problem? Please describe
TransportBroadcastByNodeAction
is an abstract class that runs an action on shards (shardOperation()
which is overridden by inheriting classes), and afterwards combines shard-level results on each node, and sends that NodeResponse
back to the coordinating node.
I was thinking it might be useful to have a hook that would run once per node, after all that node's shard-level operations have completed, before sending the aggregated NodeResponse
back.
This is motivated by TransportClearIndicesCacheAction
(see #19118). Today in the request cache, each invocation of shardOperation()
queues a cleanup key for that shard (marking it for removal in the next cleanup), and then immediately force cleans the node-cache, triggering a full iteration through all its keys. Since we do this once per shard on the node it can extend cleanup times with unnecessary iterating, especially if the disk tier is used (since its removal flow is much slower). It would be ideal if each shardOperation()
only queued the cleanup key for that shard, and then we iterated a single time per node by force cleaning the cache during this node-level hook. I imagine this might be useful for other actions too.
Describe the solution you'd like
We could run the hook within TransportBroadcastByNodeAction.BroadcastByNodeTransportRequestHandler.messageReceived()
, at line 481. This is right after shardOperation()
has run for each shard, and all results/exceptions have been added to the list of results, but right before creating the NodeResponse
and sending it back.
The method signature could be something like protected default nodeOperation(NodeRequest request, List<ShardOperationResult> results, List<BroadcastShardOperationFailedException> accumulatedExceptions, Task task)
. The default body would be noop. For cache clear specifically we would only need request
but the shard results + task might be useful for other implementers.
One thing to consider is exception handling. I'm inclined to have exceptions in the hook result in the failure of the node-level action, and implementers can swallow exceptions themselves in nodeOperation()
if they don't want that to happen. It seems more flexible than the other way around, and the body of nodeOperation()
may be critical enough to the whole action that we would want to surface exceptions there.
Related component
Other
Describe alternatives you've considered
For the specific cache clear problem, another solution is to just remove the force clear from the action altogether, and wait for the next scheduled cleanup to pick up all the marked keys. This would take ~1min. I think this is an acceptable solution too, but adding a node-level hook would allow us to both make cache clear more efficient and not have to wait for clearing to begin after the API is hit. The same logic applies for the field data cache which is currently being reworked.
Additional context
No response