Skip to content

SOLR-17319 : Combined Query Feature for Multi Query Execution #3418

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

ercsonusharma
Copy link

@ercsonusharma ercsonusharma commented Jul 4, 2025

https://issues.apache.org/jira/browse/SOLR-17319

Description

This feature aims to execute multiple queries of multiple kinds across multiple shards of a collection and combine their result basis an algorithm (like Reciprocal Rank Fusion). It also help resolve the issues being discussed w.r.t the previous PR, mainly around across shard documents merging. It provides more flexibility in terms of querying extending JSON Query DSL ultimately enabling Hybrid Search in a pure way solving the shortcomings.

This feature is currently not supported for non-distributed and grouping query.

Solution

  • Extended the QueryComponent to create new CombinedQueryComponent and ResponseBuilder to create new CombinedQueryResponseBuilder supports multiple response builders to hold the state and execute multiple queries.
  • In JSON Query DSL, a parameter is added to identity Combined Query request and basis that it invokes the new CombinedQueryComponent
  • CombinedQueryComponent have multiple response builders assigned for each query. These queries are first executed at the SolrSearchIndexer level and combined them using RRF for now.
  • At Shard level also, the responses for the multiple queries are merged.

Tests

  • Added tests for testing the RRF logic independently.
  • Added tests for across search index and distributed requests.
  • Added tests to assert existing behaviour of search handler's QueryComponent as well as for the newly added CombinedQueryComponent basis the flag in json query DSL.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

*/
@Override
public void prepare(ResponseBuilder rb) throws IOException {
if (rb instanceof CombinedQueryResponseBuilder crb) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not seen this (newer) java pattern matching with instanceof before, nice!

Comment on lines 1183 to 1189
for (int i = resultSize - 1; i >= 0; i--) {
ShardDoc shardDoc = queue.pop();
shardDoc.positionInResponse = i;
// Need the toString() for correlation with other lists that must
// be strings (like keys in highlighting, explain, etc)
resultIds.put(shardDoc.id.toString(), shardDoc);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if factoring out a protected QueryComponent method for this block (and the resultSize and resultIds above) would allow the CombinedQueryComponent to override the method, avoiding the need for rb instanceof CombinedQueryResponseBuilder above e.g.

Suggested change
for (int i = resultSize - 1; i >= 0; i--) {
ShardDoc shardDoc = queue.pop();
shardDoc.positionInResponse = i;
// Need the toString() for correlation with other lists that must
// be strings (like keys in highlighting, explain, etc)
resultIds.put(shardDoc.id.toString(), shardDoc);
}
Map<Object, ShardDoc> resultIds = createResultIds(queue, ss.getOffset());

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input. This makes sense to me, and I have refactored out the method to leverage overriding.

Comment on lines +79 to +81
boolean partialResults = false;
boolean segmentTerminatedEarly = false;
List<QueryResult> queryResults = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am not familiar with RRF on partial results, if that is a concept? But wondering if conceptually it's up to the combiner to decide e.g.

Suggested change
boolean partialResults = false;
boolean segmentTerminatedEarly = false;
List<QueryResult> queryResults = new ArrayList<>();
List<Boolean> partialResults = new ArrayList<>(crb.responseBuilders.size());
List<Boolean> segmentTerminatedEarly = new ArrayList<>(crb.responseBuilders.size());
List<QueryResult> queryResults = new ArrayList<>(crb.responseBuilders.size());

and then later (pseudo code)

combinedPartialResults, combinedSegmentTerminatedEarly, combinedQueryResult = combinerStrategy.combine(partialResults, segmentTerminatedEarly, queryResults);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RRF should just merge multiple doc results irrespective of whether they are partial or complete IMHO. If any of the ResponseBuilder QueryResults contain partial results, the whole merged QueryResults should be marked as partialResults. Same should be the case with segmentTerminatedEarly.

@ercsonusharma
Copy link
Author

@alessandrobenedetti @dsmiley, please help review it whenever you can. Thanks!

* The CombinedQueryComponent class extends QueryComponent and provides support for executing
* multiple queries and combining their results.
*/
public class CombinedQueryComponent extends QueryComponent {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QueryComponent is specifically designed for Solr's distributed search processing. We override prepare method, but then invoke super.prepare with the sub response. This could quickly get uncontrolled for a query with large number of clauses.

I would suggest overriding SearchComponent and defining explicit subBuilder.process and subBuilder.prepare methods.

* @throws IOException if an I/O error occurs during preparation
*/
@Override
public void prepare(ResponseBuilder rb) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work with grouping, highlighting and faceting? Those methods from QueryComponent are not overridden here, so updated ResponseBuilders are not propagated there.

Copy link
Author

@ercsonusharma ercsonusharma Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highlighting and Faceting are separate components, so not affected, but as far as grouping is concerned, merge logic has to be there. Adding...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... so updated ResponseBuilders are not propagated there. ...

So to perhaps illustrate with an example, https://github.com/apache/solr/blob/releases/solr/9.8.1/solr/core/src/java/org/apache/solr/handler/component/HighlightComponent.java#L97 sets the rb.doHighlights flag and this would be on on the (CombinedQuery)ResponseBuilder builder but not CombinedQueryResponseBuilder.responseBuilders builders.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and (CombinedQuery)ResponseBuilder builder is already populated with all the parameters including highlights query here.

* QueryAndResponseCombiner strategy, and sets the appropriate results and metadata in the
* CombinedQueryResponseBuilder.
*
* @param rb the ResponseBuilder object to process
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can abstract the subquery propagated execution in a separate class:

public class SubQueryExecutor {
  private final SolrQueryRequest sharedReq;
  private final List<SearchComponent> components;

  public SubQueryExecutor(SolrQueryRequest req, List<SearchComponent> components) {
    this.sharedReq = req;
    this.components = components;
  }

  public void execute(List<ResponseBuilder> builders) throws IOException {
    for (ResponseBuilder rb : builders) {
      for (SearchComponent c : components) {
        c.prepare(rb);  // or distributedPrepare
      }
    }
    for (ResponseBuilder rb : builders) {
      for (SearchComponent c : components) {
        c.process(rb);  // or distributedProcess
      }
    }
  }
}

This will avoid nesting like super.prepare(rbNew);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of orchestration is already happening in SearchHandler - IMO, iterating through each ResponseBuilder is not needed for every SearchComponent. Only the CombinedQueryComponent needs the multiple queries ResponseBuilder for multi-query execution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... Only the CombinedQueryComponent needs the multiple queries ResponseBuilder for multi-query execution.

From code reading it appears that if one wanted to have highlighting for the results being combined then the highlighting component would also need access.

But then again, perhaps that and various things could be initially deferred as out-of-scope (and documented as such) e.g. no combining with highlighting or faceting or cursor mark functionality.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple Queries are executed in CombinedQueryComponent (just queries) which is set inside the queries field in JSON query DSL. After that, all the other components like Faceting and highlighting happens from (CombinedQuery)ResponseBuilder builder.

From code reading it appears that if one wanted to have highlighting for the results being combined then the highlighting component would also need access.

Not exactly, the highlighting components works by highlighting on the rb.getResults() set which is already create in the CombinedQueryComponent.

for (int i = 0; i < resultSize; i++) {
ShardDoc shardDoc = combinedShardDocs.get(i);
shardDoc.positionInResponse = i;
maxScore = Math.max(maxScore, shardDoc.score);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this normalise across different query types (KNN, BM25,, filters)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalisation is not applicable for RRF

String docId = scoredDoc.getKey();
Float score = scoredDoc.getValue();
ShardDoc shardDoc = docIdToShardDoc.get(docId);
shardDoc.score = score;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dangerous - this is mutating the original ShardDoc object. It might be referred to by another component, and is a bad idea to modify in place.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShardDoc is local to the mergeIds method and not shared in any other object available to other component. Also, SolrDocumentList is created later using ShardDoc. Please help me understand if it's being shared anywhere else.

public static QueryAndResponseCombiner getImplementation(SolrParams requestParams) {
String algorithm =
requestParams.get(CombinerParams.COMBINER_ALGORITHM, CombinerParams.RECIPROCAL_RANK_FUSION);
if (algorithm.equals(CombinerParams.RECIPROCAL_RANK_FUSION)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hardcoded - why not have a Plugin interface here, allowing dynamic plugin loaded here?

* @return a list of explanations for the given queries and results
* @throws IOException if an I/O error occurs during the explanation retrieval process
*/
public abstract NamedList<Explanation> getExplanations(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please implement support for debug as well

@@ -240,7 +241,7 @@ public void changed(SolrPackageLoader.SolrPackage pkg, Ctx ctx) {
}

@SuppressWarnings({"unchecked"})
private void initComponents() {
private void initComponents(boolean isCombinedQuery) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a code smell - initComponents should not be changing behaviour based on a flag specific to a component.

This would be solved if we inherited from SearchHandler or dynamically injected CombinedQueryComponent using a factory pattern

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly agree with the smell!

import org.junit.BeforeClass;

/**
* The CombinedQueryComponentTest class is a unit test suite for the CombinedQueryComponent in Solr.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add tests for queries returning no results and score ties

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to add user input for ordering the docs across multiple queries in case of tie.

// cosine distance vector1= 0.970
docs.get(6).addField(vectorField, Arrays.asList(5f, 10f, 20f, 40f));
// cosine distance vector1= 0.515
docs.get(7).addField(vectorField, Arrays.asList(120f, 60f, 30f, 15f));
Copy link
Contributor

@atris atris Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add test for the RRF score calculation explainability

while (docs.hasNext() && ranking <= upTo) {
int docId = docs.nextDoc();
float rrfScore = 1f / (k + ranking);
docIdToScore.compute(docId, (id, score) -> (score == null) ? rrfScore : score + rrfScore);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is assuming that each query returns upTo number of documents - what happens when a query returns lesser number of documents?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, only docs.size() number of documents are ranked.

totalMatches = Math.max(totalMatches, rankedList.matches());
int ranking = 1;
while (docs.hasNext() && ranking <= upTo) {
int docId = docs.nextDoc();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upTo limit is per query, not a global top N. In fusion part, we return all unique docs across all subqueries. Where are we enforcing user specified top N limit?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As usual, User-specified top N is being enforced at the shard level here
and search index level at SolrIndexSearcher by setting the nums and offset in SortSpec.

}
}
List<Map.Entry<Integer, Float>> sortedByScoreDescending =
docIdToScore.entrySet().stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essentially number of queries * upto. Have we scale tested this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to enforce some limit on the no of queries to avoid a burst of queries. Adding..


int combinedResultsLength = docIdToScore.size();
int[] combinedResultsDocIds = new int[combinedResultsLength];
float[] combinedResultScores = new float[combinedResultsLength];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about early termination for non competitive iterators?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early termination is not applicable in this context, as the complete set of documents is required for the RRF (Reciprocal Rank Fusion) algorithm to function correctly.

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really glad to see this work began by acknowledging the existing work and trying to address the pitfalls!

* @return a map of shard documents, where the keys are the shard IDs as strings, and the values
* are the corresponding ShardDoc objects
*/
protected Map<Object, ShardDoc> createShardResult(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the key be a String and not an Object?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be String but the ResponseBuilder has type Object so had to keep it Object.

@@ -240,7 +241,7 @@ public void changed(SolrPackageLoader.SolrPackage pkg, Ctx ctx) {
}

@SuppressWarnings({"unchecked"})
private void initComponents() {
private void initComponents(boolean isCombinedQuery) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly agree with the smell!

ShardFieldSortedHitQueue queue,
Map<String, List<ShardDoc>> shardDocMap,
SolrDocumentList responseDocs) {
Map<Object, ShardDoc> resultIds = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See org.apache.solr.common.util.CollectionUtil#newHashMap and pre-size

Comment on lines 35 to 36
* The QueryAndResponseCombiner class is an abstract base class for combining query results and
* responses. It provides a framework for different algorithms to be implemented for merging ranked
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

results & responses -- seem synonymous to me.

@alessandrobenedetti
Copy link
Contributor

Hi @ercsonusharma , thanks for resurrecting this, didn't have time to dedicate to the feature in the last few months, good to see some movement!

In the next couple of weeks, I should be able to give it a go and review it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants