Skip to content

Fix missing removal of query cancellation callback in QueryPhase #130279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

jessepeixoto
Copy link
Contributor

@jessepeixoto jessepeixoto commented Jun 28, 2025

Description

This PR aims to address the issue #130071. The cause seems to be the timout cancellation callback registered by QueryPhase via addQueryCancellation is not removed after query phase. If the callback remains, it may interfere with other phases by triggering unintended timeouts or cancellations.

Looking at the history, this behavior may have been introduced by #98715 in v8.11.0, which removed a finally block in QueryPhase that previously handled callback cleanup. Reintroducing the cleanup logic appears to resolve the issue and ensures predictable behavior.

Steps to Reproduce

To reproduce the issue, the following setup is required:

  • An index with a reasonably large data volume (≥ 30GB);
  • A search query that includes:
    • allow_partial_search_results: true
    • A very low timeout value to stress the system (e.g., timeout: "1ms")
    • A sort clause that increases fetch-phase cost (e.g., sort by _score followed by a high-cardinality field)

Example query:

GET my-heavy-index/_search?allow_partial_search_results=true
{
  "size": 1,
  "timeout": "1ms",
  "query": {
    "match_all": {}
  },
  "sort": [
    "_score",
    {
      "high_cardinality_field": "asc"
    }
  ]
}

Expected (patched behavior):

{
  "timed_out": true,
  "_shards": {
    "failed": 0
  },
  "hits": {
    "hits": [ ... ]
  }
}

Actual (buggy behavior):
In affected versions, running the query multiple times may eventually trigger:

{
  "error": {
    "type": "search_phase_execution_exception",
    "phase": "fetch",
    ...
  },
  "status": 500
}

@elasticsearchmachine elasticsearchmachine added v9.2.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jun 28, 2025
@jessepeixoto jessepeixoto marked this pull request as ready for review June 30, 2025 12:04
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jun 30, 2025
@jessepeixoto
Copy link
Contributor Author

I think there is a high probability that the issue #123568 is related to this PR.

@piergm piergm self-assigned this Jun 30, 2025
@piergm piergm added :Search Foundations/Search Catch all for Search Foundations and removed needs:triage Requires assignment of a team area label labels Jun 30, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jun 30, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@piergm piergm added v8.18.4 v9.1.1 v8.19.1 v9.0.4 auto-backport Automatically create backport pull requests when merged labels Jun 30, 2025
@piergm piergm added the >bug label Jun 30, 2025
@piergm
Copy link
Member

piergm commented Jun 30, 2025

@elasticmachine test this please

@DaveCTurner
Copy link
Contributor

I think there is a high probability that the issue #123568 is related to this PR.

I don't think so, the issue you link relates to the timeout mechanism built into the transport layer but searches do not use this mechanism AFAICT. This comment lists the places that might be affected.

@jessepeixoto
Copy link
Contributor Author

Hi @piergm

Did I miss something?
I see there are two pending checks, but I'm not sure how to run them, could you please advise?

@piergm
Copy link
Member

piergm commented Jul 2, 2025

Hey @jessepeixoto,

Did I miss something?

No, with the command above I triggered the CI checks. I'll trigger them again, not sure why they are still pending.

@piergm
Copy link
Member

piergm commented Jul 2, 2025

@elasticmachine test this please

@piergm
Copy link
Member

piergm commented Jul 2, 2025

buildkite test this

@jessepeixoto
Copy link
Contributor Author

jessepeixoto commented Jul 3, 2025

Hey @piergm

I think these test failures seems to be unrelated to the code change.

Those Backward Compatibility tests and the Unit one failed in CI but pass consistently when I run it locally:

  • bwcTestPart3: :x-pack:plugin:sql:qa:jdbc:single-node:v8.17.9#bwcTest
  • mixedClusterTest: :qa:mixed-cluster:v9.0.4#mixedClusterTest
  • BucketedSortForIntsTests.testManyBucketsManyHits: :server:test
    • org.elasticsearch.search.sort.BucketedSortForIntsTests.testManyBucketsManyHits

@jessepeixoto
Copy link
Contributor Author

@elasticsearchmachine run elasticsearch-ci

@piergm
Copy link
Member

piergm commented Jul 7, 2025

@jessepeixoto Thanks for working on this. Now I have some cycles to look at this more deeply. I'll update you soon.

@piergm
Copy link
Member

piergm commented Jul 8, 2025

@jessepeixoto Thanks for working on this.
The PR looks to solve a different bug from what we see in #130071. In the linked issue we can see from the logs that the error occurred during the fetch phase (Failed to execute phase [fetch]) while the fix you are proposing is on the query phase, therefore even if this PR gets merged we would sill see the Failed to execute phase [fetch] error.
The problem we are seeing for #130071 is that the part of the code responsible for merging the hits after we get all the documents from the fetch phase expects either the exact number of documents requested by the coordinating node or a timeout exception (if allow_partial_search_results=false).
The bug reported here (#130071) happens because during the fetch phase in the data node there is a check for the timeout that, if we are over the time limit set, would either throw an exception (with allow_partial_search_results=false) or return an empty (or partial) response (with allow_partial_search_results=true).
When merging the hits from the different shards if we find an empty response while expecting a non-empty array we get the exception we see in the issue linked: ArrayIndexOutOfBoundsException .
What we would have to handle for the issue #130071 is originates in FetchPhaseDocsIterator.java, where we return an EMPTY SearchHits that is not expected during this merge phase.


On the tests errors: elasticsearch-ci/part-1 is failing in the test you added (not sure you have access to the logs/stack trace) so here's what I see:


java.lang.AssertionError: |  
-- | --
Expected: an empty collection |  
but: <[LEAK: resource was not cleaned up before it was garbage-collected. |  
Recent access records: |  
Created at: |  
in [TEST-QueryPhaseTimeoutTests.testCancellationCallbackRemoved-seed#[B55532AB704984B4]][testCancellationCallbackRemoved] |  
org.elasticsearch.search.internal.SearchContext.<init>(SearchContext.java:78) |  
org.elasticsearch.test.TestSearchContext.<init>(TestSearchContext.java:110) |  
org.elasticsearch.test.TestSearchContext.<init>(TestSearchContext.java:102) |  
org.elasticsearch.search.query.QueryPhaseTimeoutTests$6.<init>(QueryPhaseTimeoutTests.java:494) |  
org.elasticsearch.search.query.QueryPhaseTimeoutTests.testCancellationCallbackRemoved(QueryPhaseTimeoutTests.java:494) |  
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) |  
java.base/java.lang.reflect.Method.invoke(Method.java:565) |  
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1763) |  
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946) |  
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982) |  
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996) |  
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
org.junit.rules.RunRules.evaluate(RunRules.java:20) |  
org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48) |  
org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) |  
org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45) |  
org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60) |  
org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44) |  
org.junit.rules.RunRules.evaluate(RunRules.java:20) |  
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390) |  
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843) |  
com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490) |  
com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955) |  
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840) |  
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891) |  
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902) |  
org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) |  
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38) |  
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) |  
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) |  
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53) |  
org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) |  
org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44) |  
org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60) |  
org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47) |  
org.junit.rules.RunRules.evaluate(RunRules.java:20) |  
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390) |  
com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850) |  
java.base/java.lang.Thread.run(Thread.java:1447)]>

@jessepeixoto
Copy link
Contributor Author

jessepeixoto commented Jul 9, 2025

@piergm, thanks for the detailed analysis. You're absolutely right, the error occurs during the FetchPhase while the proposed fix is applied in the QueryPhase. But interestingly, once this fix is in place, the error no longer occurs.

My hypothesis is the following:

When fetch is intensive (for example, due to sorting or other costly operations), the leftover timeout callback from the QueryPhase can still be active and trigger an unexpected cancellation during fetch. This causes the FetchPhase to return an empty or incomplete SearchHit[], which then leads to a crash.

However, when we remove the timeout callback right after the QueryPhase, the FetchPhase is no longer exposed to this late interruption, and the error disappears.

Here is an example timeline:

  • timeout = 150ms
  • allowPartialSearchResults = true
Time Range Phase Status Notes
0ms – 100ms Query phase Completes successfully
100ms – 200ms Fetch phase 💥 Timeout at 150ms → fetch interrupted → crash

@jessepeixoto
Copy link
Contributor Author

jessepeixoto commented Jul 9, 2025

While reviewing PR #98715, I noticed that the removal of removeQueryCancellation(...) block from the QueryPhase seems intentional, to allow the timeout cancellation to propagate not only to the query phase, but also to later/other phases like fetch, rescore, and suggest.

So it looks like it is not a good idea to simply revert the removal as I initially did but to keep letting timeout active for all phases, except when allowPartialSearchResults == true.

When partial results are allowed, I believe it means that later phases like fetch have to be fully executed and not interrupted, so the user still gets whatever results are available. In that case, only the query phase should be subject to timeout cancellation, and fetch should be allowed to complete normally.

This small change below keeps the intended propagation of cancellation when allowPartialSearchResults == false, but removes it when allowPartialSearchResults == true:

} finally {
    if (searchContext.request().allowPartialSearchResults() && timeoutRunnable != null)
        searcher.removeQueryCancellation(timeoutRunnable);
    }
}

@javanna
Copy link
Member

javanna commented Jul 10, 2025

Hey @jessepeixoto ,
I wanted to point out that sorting isn't heavy for the fetch phase. Fetch happens once all top hits have identified.

I understand that with your change, the fetch issue no longer manifests. Do you still get partial results though, and the timed_out flag set to true? I wonder if the timeout happens earlier, or it no longer happens.

Out of curiosity, without the change, how often can you reproduce the timeout in the fetch phase (hence the issue)? Out of the times that the search goes well, how often do you get partial results and how often you do not?

Thanks!

@javanna
Copy link
Member

javanna commented Jul 10, 2025

When partial results are allowed, I believe it means that later phases like fetch have to be fully executed and not interrupted

I don't believe so, the search timeout should be honoured no matter where it happens. In that case, the fetch will time out and only the hits that were fetched until then will be returned. That was the purpose of handling timeouts in #116676 .

@javanna
Copy link
Member

javanna commented Jul 10, 2025

I spent some time reviewing the code around the timeout callback and checking whether removeQueryCancellation should be called in a finally block. My assessment is that the removeQueryCancellation method should have been removed when we stopped calling it in prod code, I need to check if tests still need it.

We don't reuse search context instances across search phases. The only place where we may use the same searcher instance is AggregationPhase and QueryPhase, where the latter calls the former. Yet the query phase is the last bit that happens and that searcher, where you are restoring the remove call, is no longer reused after that. We do keep readers around for the fetch phase after the query phase, but the searcher instance is recreated out of those reader contexts, hence the query cancellation callback starts empty again and gets configured according to the provided timeout like at the beginning of the query phase.

Could you share more details about what led you to open this PR, and how you got to the conclusion that this was the problem? Do you perhaps have custom plugins installed in your cluster that may reuse the context/searcher instance unexpectedly? Does my explanation make sense to you?

Could you perhaps share more details about your search requests? Any specific query / aggs, or just what you had above, with an expensive sort? With the expensive sort though, perhaps we should rather look at why that does not time out during query phase, where the sorting actually happens. It may help to share a profile output of your query to better understand.

@jessepeixoto
Copy link
Contributor Author

jessepeixoto commented Jul 11, 2025

@javanna, thanks for the detailed inputs.

I understand that with your change, the fetch issue no longer manifests. Do you still get partial results though, and the timed_out flag set to true? I wonder if the timeout happens earlier, or it no longer happens.

Yes, I get partial results and the timed_out flag is set to true

Could you share more details about what led you to open this PR

On November 21, 2024, I upgraded our elasticsearch cluster from v8.10.4 to v8.15.2. Immediately after the upgrade, I started seeing timeout errors in about 1% of search requests. To find the cause, I tested several intermediate versions between v8.10.4 and v8.15.2 and confirmed the issue was introduced in v8.11.0.

and how you got to the conclusion that this was the problem?

Reviewing the changes in that release v8.11.0, especially in this PR #98715, I saw that the finally block that removed the query cancellation callback had been removed. After reintroducing that block, the issue no longer reproduced, partial results were returned correctly, and the errors stopped.

We don't reuse search context instances across search phases.

From what I’ve seen, the issue could be in SearchService.java when the index has only one shard, which is exactly my scenario.

 if (request.numberOfShards() == 1 && (request.source() == null || request.source().rankBuilder() == null)) {
    // we already have query results, but we can run fetch at the same time
    context.addFetchResult();
    return executeFetchPhase(readerContext, context, afterQueryTime);
 }

In this scenario, I think the same search context instance is reused from the query phase into the fetch phase, so any timeout callback registered during the query phase is still active. I checked this by adding logs and observing the following sequence:

🟢️ QueryPhase.execute starts
🔥 TIMEOUT FIRED during [query] phase
🟡️ QueryPhase.execute finishes
🟢️ FetchPhase.execute starts
🔥 TIMEOUT FIRED during [fetch] phase
🟡️ FetchPhase.execute finishes

Perhaps not in the ideal way, but these changes prevent timeouts from firing during the fetch phase.

Note: Running the exact same query on the same data, but with an index that has more than one shard, does not reproduce the issue, so the problem only occurs with a single-shard index.

Does that makes sense?

@jessepeixoto
Copy link
Contributor Author

jessepeixoto commented Jul 11, 2025

By the way, as I realized during this last investigation that the issue only occurs when the index has only one primary shard, I’ll update the affected indices to have more than one as a quick workaround. 🙂

@javanna
Copy link
Member

javanna commented Jul 11, 2025

Thanks a lot for the additional details @jessepeixoto ! I understand what is happening! The single shard is definitely what triggers the issue because in that case we query and fetch at once, given no reduction is needed on the coordinating node. I completely overlooked that codepath :)

In that case we do reuse the same search context instance hence I understand why your fix does address the issue for you. But there's more. The timeout and cancellation checks are performed using the same mechanisms internally, via a callback that periodically checks 1) if the query timed out 2) if the search has been cancelled . This is a bit hard to follow and there's some subtleties in that we use cancellation and timeout interchangeably in the code, but the two are actually two different things built on the same low level mechanism.

I see now that the fetch phase still does not honour the search timeout, which goes back to your previous comment, and that makes me reconsider the answer I posted above:

When partial results are allowed, I believe it means that later phases like fetch have to be fully executed and not interrupted

indeed, if the query phase times out, and we time out fetch execution too, we won't be able to return partial top hits but only partial aggs results. That would be a problem. That is why we don't register the timeout callback to the searcher of the fetch phase. Yet in your specific scenario, the fetch phase will get it because it has not been cleaned after the query phase. That is where your proposed fix helps in your case. The fetch phase gets otherwise only interrupted if the search was cancelled and it should not get interrupted because of a search timeout.

The change I made in #116676 added handling of a timeout error in the fetch phase , but in reality the only reason why we'd get that is a search cancellation as opposed to a search timeout. This is where #130071 comes in: if the search has been cancelled, we should stop the fetch phase, but returning partial results from it will cause inconsistencies in the coordinating node merging responses (the error you were getting). This still needs fixing regardless of whether we remove the query cancellation check for the single shard query and fetch.

My conclusion is that we need both fixes:

  • don't apply the search timeout to the fetch phase: if the query phase times out, we go and fetch what the top hits were until then, the fetch phase should not time out. I will review your PR and tests.
  • if a search gets cancelled, handle the inconsistency between expected results and fetched results in the coordinating node as opposed to blowing up like we do now (ArrayIndexOutOfBoundsException for timeouts during the fetch phase #130071). We would not have seen this in your case without the bug you found and proposed a fix for.

Thanks for your patience here, it has taken a bit to figure out what's going on, but I am happy we did. And thanks a lot for your investigation, a really good catch about the timeout callback not removed.

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of comments, thanks a lot for looking into this and opening your PR!

* preserving the cross-phase timeout propagation introduced in PR #98715.
*/
if (searchContext.request().allowPartialSearchResults() && timeoutRunnable != null) {
searcher.removeQueryCancellation(timeoutRunnable);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is not wrong, but perhaps incomplete: we duplicate the low level cancellation runnable registered in DefaultSearchContext#preProcess as well, so I think that we should rather clear all the cancellation checks.

One way to do that would be to recreate the searcher before executing the fetch phase for the single shard case, but I see that requires quite some changes, as we don't want to rebuild the entire search context which is a rather heavy object.

Another way would be to call the existing ContextIndexSearcher#close method, but reusing the searcher after closing it sounds like an anti-pattern, although it would work in this case (relying on the fact that close only clears the cancellation runnables).

Maybe a better way would be to replace the current removeQueryCancellation method with a removeQueryCancellations that clears them all like close does. I would still call it tough only where needed, meaning in the only place where we effectively reuse the context/searcher. Otherwise, it is not evident why we need to do so in query phase as opposed to other places.

I prefer this explicit treatment before executing fetch for the single shard scenario, because it addresses the edge case, and I am not sure we want to handle it in a generic manner by removing cancellation checks where searchers are normally not reused across phases.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I don't think we need the conditional based on allowPartialSearchResults. If we do not allow partial search results, a hard error will be thrown at the end of the query phase before doing fetch if there's a timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this explicit treatment before executing fetch for the single shard scenario, because it addresses the edge case, and I am not sure we want to handle it in a generic manner by removing cancellation checks where searchers are normally not reused across phases.

I updated the code with those changes.

By the way, I don't think we need the conditional based on allowPartialSearchResults. If we do not allow partial search results, a hard error will be thrown at the end of the query phase before doing fetch if there's a timeout.

Absolutely, I completely agree!

I believe the latest changes address the feedback, but let me know if you'd like anything refined further.

QueryPhase.executeQuery(ctx);
assertNotNull("callback should be registered", searcher.added);
}
assertFalse("callback must stay registered for later phases", searcher.removed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could also have checked that there are no cancellation runnables using hasCancellations? Would that remove the need for the TrackingSearcher ? Anyways, I believe if you follow my guidance in the other comment, I think that this will have to become more of a fetch phase test than a query phase test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the previous test because, with the current changes, the logic now lives in SearchService, not in the query phase anymore.

I wasn't sure where exactly to test whether context.searcher().removeQueryCancellations() is called, maybe in SearchServiceSingleNodeTests, but I wasn't confident about the best approach.

Happy to add a test with some guidance, or of course feel free to adjust the PR directly if you'd like.

@jessepeixoto
Copy link
Contributor Author

jessepeixoto commented Jul 13, 2025

Hi @javanna,
Thanks a lot for the very clear review, I agree with all the points you raised.

The timeout and cancellation checks are performed using the same mechanisms internally, via a callback that periodically checks 1) if the query timed out 2) if the search has been cancelled . This is a bit hard to follow and there's some subtleties in that we use cancellation and timeout interchangeably in the code, but the two are actually two different things built on the same low level mechanism.

That distinction really helped me to see the full picture of the problem.

My conclusion is that we need both fixes:

  • don't apply the search timeout to the fetch phase: if the query phase times out, we go and fetch what the top hits were until then, the fetch phase should not time out. I will review your PR and tests.
  • if a search gets cancelled, handle the inconsistency between expected results and fetched results in the coordinating node as opposed to blowing up like we do now (ArrayIndexOutOfBoundsException for timeouts during the fetch phase #130071). We would not have seen this in your case without the bug you found and proposed a fix for.

I see and agree! In my case, the ArrayIndexOutOfBoundsException was always caused by timeout-based cancellations. Switching from 1 to 2 primaries made it go away. :)

I've updated the code based on your feedback and added a couple of comments.

Thanks again for the review and all the insights!

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another review and left new comments. This is going in the right direction.

As for testing, I would suggest adding some test to SearchTimeoutIT. I would test two separate scenarios that are currently missing:

  1. fetch should never timeout, which is what you fix addresses
  2. fetch still reacts gracefully to search cancellations

These are not simple tests to write, I can help writing those if needed.

@@ -907,7 +907,9 @@ private SearchPhaseResult executeQueryPhase(ShardSearchRequest request, Cancella
tracer.stopTrace(task);
}
if (request.numberOfShards() == 1 && (request.source() == null || request.source().rankBuilder() == null)) {
// we already have query results, but we can run fetch at the same time
// in this case, we reuse search context across search and fetch phases.
// so we need to remove all cancelation callbacks from query phase before starting fetch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you leave the initial comment, and perhaps rewrite the new part to something like

in this case we reuse the search context across search and fetch phase, hence we need to clear the cancellation checks that were applied by the query phase before running fetch. Note that the timeout checks are not applied to the fetch phase, while the cancellation checks are.

@@ -187,6 +187,11 @@ public void close() {
this.cancellable.clear();
}

// remove all registered cancellation callbacks to prevent them from leaking into other phases
public void removeQueryCancellations() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have second thoughts on the naming I picked, sorry! Should we go with clearQueryCancellations ?

// we already have query results, but we can run fetch at the same time
// in this case, we reuse search context across search and fetch phases.
// so we need to remove all cancelation callbacks from query phase before starting fetch.
context.searcher().removeQueryCancellations();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two main scenarios where we get into this codepath:

  1. multiple shards: new search context just created, #preProcess was called, the cancellation checks are added
  2. single shard: search context being reused, we need to clean up the timeout check but not the cancellation checks. Alternatively, we can clear it all up and then add back the cancellation check only.

We are good for the first scenario.

I think that in the second case, calling preProcess again may fix it but other cause other side effects. Perhaps we should restore the cancellation check as follows after the removal:

if (context.lowLevelCancellation()) {
    context.searcher().addQueryCancellation(() -> {
        if (task != null) {
            task.ensureNotCancelled();
        }
    });
}

I still prefer this over selectively removing only the timeout check, although it's not super clear. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this solution, it addresses the problem in a focused way, and the comment makes it very clear.

@jessepeixoto
Copy link
Contributor Author

jessepeixoto commented Jul 14, 2025

Regarding the tests, I'd be happy to give it a try, but since they involve deeper internals, feel free to take that part if you prefer, I'd really appreciate it, and it is also a good chance to understand this part better.

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, I will take care of the harder bits of testing, going to open a PR that effectively removes timeout handling in the fetch phase, because it's not necessary. Will add the tests directly there.

Perhaps one thing that could be added is a small unit test that verifies clearQueryCancellations does what it's supposed to do. That would fit in ContextIndexSearcherTests, is this something you'd like adding?

@jessepeixoto
Copy link
Contributor Author

jessepeixoto commented Jul 15, 2025

I added the clear query cancellations test using searcher.hasCancellations(), it felt more evident to me. Let me know if it works for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug external-contributor Pull request authored by a developer outside the Elasticsearch team :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v8.18.5 v8.19.1 v9.0.5 v9.1.1 v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants