-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Fix missing removal of query cancellation callback in QueryPhase #130279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix missing removal of query cancellation callback in QueryPhase #130279
Conversation
I think there is a high probability that the issue #123568 is related to this PR. |
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
@elasticmachine test this please |
I don't think so, the issue you link relates to the timeout mechanism built into the transport layer but searches do not use this mechanism AFAICT. This comment lists the places that might be affected. |
Hi @piergm Did I miss something? |
Hey @jessepeixoto,
No, with the command above I triggered the CI checks. I'll trigger them again, not sure why they are still pending. |
@elasticmachine test this please |
buildkite test this |
Hey @piergm I think these test failures seems to be unrelated to the code change. Those Backward Compatibility tests and the Unit one failed in CI but pass consistently when I run it locally:
|
@elasticsearchmachine run elasticsearch-ci |
@jessepeixoto Thanks for working on this. Now I have some cycles to look at this more deeply. I'll update you soon. |
@jessepeixoto Thanks for working on this. On the tests errors: elasticsearch-ci/part-1 is failing in the test you added (not sure you have access to the logs/stack trace) so here's what I see:
|
@piergm, thanks for the detailed analysis. You're absolutely right, the error occurs during the My hypothesis is the following: When fetch is intensive (for example, due to sorting or other costly operations), the leftover timeout callback from the However, when we remove the timeout callback right after the Here is an example timeline:
|
While reviewing PR #98715, I noticed that the removal of So it looks like it is not a good idea to simply revert the removal as I initially did but to keep letting timeout active for all phases, except when When partial results are allowed, I believe it means that later phases like fetch have to be fully executed and not interrupted, so the user still gets whatever results are available. In that case, only the query phase should be subject to timeout cancellation, and fetch should be allowed to complete normally. This small change below keeps the intended propagation of cancellation when
|
Hey @jessepeixoto , I understand that with your change, the fetch issue no longer manifests. Do you still get partial results though, and the Out of curiosity, without the change, how often can you reproduce the timeout in the fetch phase (hence the issue)? Out of the times that the search goes well, how often do you get partial results and how often you do not? Thanks! |
I don't believe so, the search timeout should be honoured no matter where it happens. In that case, the fetch will time out and only the hits that were fetched until then will be returned. That was the purpose of handling timeouts in #116676 . |
I spent some time reviewing the code around the timeout callback and checking whether We don't reuse search context instances across search phases. The only place where we may use the same searcher instance is Could you share more details about what led you to open this PR, and how you got to the conclusion that this was the problem? Do you perhaps have custom plugins installed in your cluster that may reuse the context/searcher instance unexpectedly? Does my explanation make sense to you? Could you perhaps share more details about your search requests? Any specific query / aggs, or just what you had above, with an expensive sort? With the expensive sort though, perhaps we should rather look at why that does not time out during query phase, where the sorting actually happens. It may help to share a profile output of your query to better understand. |
@javanna, thanks for the detailed inputs.
Yes, I get partial results and the
On November 21, 2024, I upgraded our elasticsearch cluster from
Reviewing the changes in that release
From what I’ve seen, the issue could be in
In this scenario, I think the same search context instance is reused from the query phase into the fetch phase, so any timeout callback registered during the query phase is still active. I checked this by adding logs and observing the following sequence: 🟢️ QueryPhase.execute starts Perhaps not in the ideal way, but these changes prevent timeouts from firing during the fetch phase. Note: Running the exact same query on the same data, but with an index that has more than one shard, does not reproduce the issue, so the problem only occurs with a single-shard index. Does that makes sense? |
By the way, as I realized during this last investigation that the issue only occurs when the index has only one primary shard, I’ll update the affected indices to have more than one as a quick workaround. 🙂 |
Thanks a lot for the additional details @jessepeixoto ! I understand what is happening! The single shard is definitely what triggers the issue because in that case we query and fetch at once, given no reduction is needed on the coordinating node. I completely overlooked that codepath :) In that case we do reuse the same search context instance hence I understand why your fix does address the issue for you. But there's more. The timeout and cancellation checks are performed using the same mechanisms internally, via a callback that periodically checks 1) if the query timed out 2) if the search has been cancelled . This is a bit hard to follow and there's some subtleties in that we use cancellation and timeout interchangeably in the code, but the two are actually two different things built on the same low level mechanism. I see now that the fetch phase still does not honour the search timeout, which goes back to your previous comment, and that makes me reconsider the answer I posted above:
indeed, if the query phase times out, and we time out fetch execution too, we won't be able to return partial top hits but only partial aggs results. That would be a problem. That is why we don't register the timeout callback to the searcher of the fetch phase. Yet in your specific scenario, the fetch phase will get it because it has not been cleaned after the query phase. That is where your proposed fix helps in your case. The fetch phase gets otherwise only interrupted if the search was cancelled and it should not get interrupted because of a search timeout. The change I made in #116676 added handling of a timeout error in the fetch phase , but in reality the only reason why we'd get that is a search cancellation as opposed to a search timeout. This is where #130071 comes in: if the search has been cancelled, we should stop the fetch phase, but returning partial results from it will cause inconsistencies in the coordinating node merging responses (the error you were getting). This still needs fixing regardless of whether we remove the query cancellation check for the single shard query and fetch. My conclusion is that we need both fixes:
Thanks for your patience here, it has taken a bit to figure out what's going on, but I am happy we did. And thanks a lot for your investigation, a really good catch about the timeout callback not removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple of comments, thanks a lot for looking into this and opening your PR!
* preserving the cross-phase timeout propagation introduced in PR #98715. | ||
*/ | ||
if (searchContext.request().allowPartialSearchResults() && timeoutRunnable != null) { | ||
searcher.removeQueryCancellation(timeoutRunnable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is not wrong, but perhaps incomplete: we duplicate the low level cancellation runnable registered in DefaultSearchContext#preProcess
as well, so I think that we should rather clear all the cancellation checks.
One way to do that would be to recreate the searcher before executing the fetch phase for the single shard case, but I see that requires quite some changes, as we don't want to rebuild the entire search context which is a rather heavy object.
Another way would be to call the existing ContextIndexSearcher#close
method, but reusing the searcher after closing it sounds like an anti-pattern, although it would work in this case (relying on the fact that close only clears the cancellation runnables).
Maybe a better way would be to replace the current removeQueryCancellation
method with a removeQueryCancellations
that clears them all like close does. I would still call it tough only where needed, meaning in the only place where we effectively reuse the context/searcher. Otherwise, it is not evident why we need to do so in query phase as opposed to other places.
I prefer this explicit treatment before executing fetch for the single shard scenario, because it addresses the edge case, and I am not sure we want to handle it in a generic manner by removing cancellation checks where searchers are normally not reused across phases.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I don't think we need the conditional based on allowPartialSearchResults
. If we do not allow partial search results, a hard error will be thrown at the end of the query phase before doing fetch if there's a timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer this explicit treatment before executing fetch for the single shard scenario, because it addresses the edge case, and I am not sure we want to handle it in a generic manner by removing cancellation checks where searchers are normally not reused across phases.
I updated the code with those changes.
By the way, I don't think we need the conditional based on allowPartialSearchResults. If we do not allow partial search results, a hard error will be thrown at the end of the query phase before doing fetch if there's a timeout.
Absolutely, I completely agree!
I believe the latest changes address the feedback, but let me know if you'd like anything refined further.
QueryPhase.executeQuery(ctx); | ||
assertNotNull("callback should be registered", searcher.added); | ||
} | ||
assertFalse("callback must stay registered for later phases", searcher.removed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could also have checked that there are no cancellation runnables using hasCancellations
? Would that remove the need for the TrackingSearcher
? Anyways, I believe if you follow my guidance in the other comment, I think that this will have to become more of a fetch phase test than a query phase test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the previous test because, with the current changes, the logic now lives in SearchService
, not in the query phase anymore.
I wasn't sure where exactly to test whether context.searcher().removeQueryCancellations()
is called, maybe in SearchServiceSingleNodeTests
, but I wasn't confident about the best approach.
Happy to add a test with some guidance, or of course feel free to adjust the PR directly if you'd like.
Hi @javanna,
That distinction really helped me to see the full picture of the problem.
I see and agree! In my case, the I've updated the code based on your feedback and added a couple of comments. Thanks again for the review and all the insights! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did another review and left new comments. This is going in the right direction.
As for testing, I would suggest adding some test to SearchTimeoutIT
. I would test two separate scenarios that are currently missing:
- fetch should never timeout, which is what you fix addresses
- fetch still reacts gracefully to search cancellations
These are not simple tests to write, I can help writing those if needed.
@@ -907,7 +907,9 @@ private SearchPhaseResult executeQueryPhase(ShardSearchRequest request, Cancella | |||
tracer.stopTrace(task); | |||
} | |||
if (request.numberOfShards() == 1 && (request.source() == null || request.source().rankBuilder() == null)) { | |||
// we already have query results, but we can run fetch at the same time | |||
// in this case, we reuse search context across search and fetch phases. | |||
// so we need to remove all cancelation callbacks from query phase before starting fetch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you leave the initial comment, and perhaps rewrite the new part to something like
in this case we reuse the search context across search and fetch phase, hence we need to clear the cancellation checks that were applied by the query phase before running fetch. Note that the timeout checks are not applied to the fetch phase, while the cancellation checks are.
@@ -187,6 +187,11 @@ public void close() { | |||
this.cancellable.clear(); | |||
} | |||
|
|||
// remove all registered cancellation callbacks to prevent them from leaking into other phases | |||
public void removeQueryCancellations() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have second thoughts on the naming I picked, sorry! Should we go with clearQueryCancellations
?
// we already have query results, but we can run fetch at the same time | ||
// in this case, we reuse search context across search and fetch phases. | ||
// so we need to remove all cancelation callbacks from query phase before starting fetch. | ||
context.searcher().removeQueryCancellations(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two main scenarios where we get into this codepath:
- multiple shards: new search context just created,
#preProcess
was called, the cancellation checks are added - single shard: search context being reused, we need to clean up the timeout check but not the cancellation checks. Alternatively, we can clear it all up and then add back the cancellation check only.
We are good for the first scenario.
I think that in the second case, calling preProcess
again may fix it but other cause other side effects. Perhaps we should restore the cancellation check as follows after the removal:
if (context.lowLevelCancellation()) {
context.searcher().addQueryCancellation(() -> {
if (task != null) {
task.ensureNotCancelled();
}
});
}
I still prefer this over selectively removing only the timeout check, although it's not super clear. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this solution, it addresses the problem in a focused way, and the comment makes it very clear.
Regarding the tests, I'd be happy to give it a try, but since they involve deeper internals, feel free to take that part if you prefer, I'd really appreciate it, and it is also a good chance to understand this part better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, I will take care of the harder bits of testing, going to open a PR that effectively removes timeout handling in the fetch phase, because it's not necessary. Will add the tests directly there.
Perhaps one thing that could be added is a small unit test that verifies clearQueryCancellations
does what it's supposed to do. That would fit in ContextIndexSearcherTests
, is this something you'd like adding?
I added the clear query cancellations test using |
Description
This PR aims to address the issue #130071. The cause seems to be the timout cancellation callback registered by
QueryPhase
viaaddQueryCancellation
is not removed after query phase. If the callback remains, it may interfere with other phases by triggering unintended timeouts or cancellations.Looking at the history, this behavior may have been introduced by #98715 in
v8.11.0
, which removed afinally
block inQueryPhase
that previously handled callback cleanup. Reintroducing the cleanup logic appears to resolve the issue and ensures predictable behavior.Steps to Reproduce
To reproduce the issue, the following setup is required:
allow_partial_search_results: true
timeout: "1ms"
)Example query:
Expected (patched behavior):
Actual (buggy behavior):
In affected versions, running the query multiple times may eventually trigger: