-
Notifications
You must be signed in to change notification settings - Fork 738
SOLR-17319 : Combined Query Feature for Multi Query Execution #3418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
*/ | ||
@Override | ||
public void prepare(ResponseBuilder rb) throws IOException { | ||
if (rb instanceof CombinedQueryResponseBuilder crb) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not seen this (newer) java pattern matching with instanceof
before, nice!
for (int i = resultSize - 1; i >= 0; i--) { | ||
ShardDoc shardDoc = queue.pop(); | ||
shardDoc.positionInResponse = i; | ||
// Need the toString() for correlation with other lists that must | ||
// be strings (like keys in highlighting, explain, etc) | ||
resultIds.put(shardDoc.id.toString(), shardDoc); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if factoring out a protected QueryComponent
method for this block (and the resultSize
and resultIds
above) would allow the CombinedQueryComponent
to override the method, avoiding the need for rb instanceof CombinedQueryResponseBuilder
above e.g.
for (int i = resultSize - 1; i >= 0; i--) { | |
ShardDoc shardDoc = queue.pop(); | |
shardDoc.positionInResponse = i; | |
// Need the toString() for correlation with other lists that must | |
// be strings (like keys in highlighting, explain, etc) | |
resultIds.put(shardDoc.id.toString(), shardDoc); | |
} | |
Map<Object, ShardDoc> resultIds = createResultIds(queue, ss.getOffset()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the input. This makes sense to me, and I have refactored out the method to leverage overriding.
boolean partialResults = false; | ||
boolean segmentTerminatedEarly = false; | ||
List<QueryResult> queryResults = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am not familiar with RRF on partial results, if that is a concept? But wondering if conceptually it's up to the combiner to decide e.g.
boolean partialResults = false; | |
boolean segmentTerminatedEarly = false; | |
List<QueryResult> queryResults = new ArrayList<>(); | |
List<Boolean> partialResults = new ArrayList<>(crb.responseBuilders.size()); | |
List<Boolean> segmentTerminatedEarly = new ArrayList<>(crb.responseBuilders.size()); | |
List<QueryResult> queryResults = new ArrayList<>(crb.responseBuilders.size()); |
and then later (pseudo code)
combinedPartialResults, combinedSegmentTerminatedEarly, combinedQueryResult = combinerStrategy.combine(partialResults, segmentTerminatedEarly, queryResults);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RRF should just merge multiple doc results irrespective of whether they are partial or complete IMHO. If any of the ResponseBuilder QueryResults contain partial results, the whole merged QueryResults should be marked as partialResults
. Same should be the case with segmentTerminatedEarly
.
@alessandrobenedetti @dsmiley, please help review it whenever you can. Thanks! |
* The CombinedQueryComponent class extends QueryComponent and provides support for executing | ||
* multiple queries and combining their results. | ||
*/ | ||
public class CombinedQueryComponent extends QueryComponent { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QueryComponent is specifically designed for Solr's distributed search processing. We override prepare method, but then invoke super.prepare with the sub response. This could quickly get uncontrolled for a query with large number of clauses.
I would suggest overriding SearchComponent and defining explicit subBuilder.process and subBuilder.prepare methods.
* @throws IOException if an I/O error occurs during preparation | ||
*/ | ||
@Override | ||
public void prepare(ResponseBuilder rb) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work with grouping, highlighting and faceting? Those methods from QueryComponent are not overridden here, so updated ResponseBuilders are not propagated there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Highlighting and Faceting are separate components, so not affected, but as far as grouping is concerned, merge logic has to be there. Adding...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... so updated ResponseBuilders are not propagated there. ...
So to perhaps illustrate with an example, https://github.com/apache/solr/blob/releases/solr/9.8.1/solr/core/src/java/org/apache/solr/handler/component/HighlightComponent.java#L97 sets the rb.doHighlights
flag and this would be on on the (CombinedQuery)ResponseBuilder
builder but not CombinedQueryResponseBuilder.responseBuilders
builders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, and (CombinedQuery)ResponseBuilder
builder is already populated with all the parameters including highlights query here.
* QueryAndResponseCombiner strategy, and sets the appropriate results and metadata in the | ||
* CombinedQueryResponseBuilder. | ||
* | ||
* @param rb the ResponseBuilder object to process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can abstract the subquery propagated execution in a separate class:
public class SubQueryExecutor {
private final SolrQueryRequest sharedReq;
private final List<SearchComponent> components;
public SubQueryExecutor(SolrQueryRequest req, List<SearchComponent> components) {
this.sharedReq = req;
this.components = components;
}
public void execute(List<ResponseBuilder> builders) throws IOException {
for (ResponseBuilder rb : builders) {
for (SearchComponent c : components) {
c.prepare(rb); // or distributedPrepare
}
}
for (ResponseBuilder rb : builders) {
for (SearchComponent c : components) {
c.process(rb); // or distributedProcess
}
}
}
}
This will avoid nesting like super.prepare(rbNew);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of orchestration is already happening in SearchHandler - IMO, iterating through each ResponseBuilder is not needed for every SearchComponent. Only the CombinedQueryComponent
needs the multiple queries ResponseBuilder for multi-query execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... Only the
CombinedQueryComponent
needs the multiple queries ResponseBuilder for multi-query execution.
From code reading it appears that if one wanted to have highlighting for the results being combined then the highlighting component would also need access.
But then again, perhaps that and various things could be initially deferred as out-of-scope (and documented as such) e.g. no combining with highlighting or faceting or cursor mark functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple Queries are executed in CombinedQueryComponent
(just queries) which is set inside the queries
field in JSON query DSL. After that, all the other components like Faceting and highlighting happens from (CombinedQuery)ResponseBuilder
builder.
From code reading it appears that if one wanted to have highlighting for the results being combined then the highlighting component would also need access.
Not exactly, the highlighting components works by highlighting on the rb.getResults() set which is already create in the CombinedQueryComponent.
for (int i = 0; i < resultSize; i++) { | ||
ShardDoc shardDoc = combinedShardDocs.get(i); | ||
shardDoc.positionInResponse = i; | ||
maxScore = Math.max(maxScore, shardDoc.score); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this normalise across different query types (KNN, BM25,, filters)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normalisation is not applicable for RRF
String docId = scoredDoc.getKey(); | ||
Float score = scoredDoc.getValue(); | ||
ShardDoc shardDoc = docIdToShardDoc.get(docId); | ||
shardDoc.score = score; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is dangerous - this is mutating the original ShardDoc object. It might be referred to by another component, and is a bad idea to modify in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShardDoc is local to the mergeIds method and not shared in any other object available to other component. Also, SolrDocumentList
is created later using ShardDoc. Please help me understand if it's being shared anywhere else.
public static QueryAndResponseCombiner getImplementation(SolrParams requestParams) { | ||
String algorithm = | ||
requestParams.get(CombinerParams.COMBINER_ALGORITHM, CombinerParams.RECIPROCAL_RANK_FUSION); | ||
if (algorithm.equals(CombinerParams.RECIPROCAL_RANK_FUSION)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hardcoded - why not have a Plugin interface here, allowing dynamic plugin loaded here?
* @return a list of explanations for the given queries and results | ||
* @throws IOException if an I/O error occurs during the explanation retrieval process | ||
*/ | ||
public abstract NamedList<Explanation> getExplanations( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please implement support for debug as well
@@ -240,7 +241,7 @@ public void changed(SolrPackageLoader.SolrPackage pkg, Ctx ctx) { | |||
} | |||
|
|||
@SuppressWarnings({"unchecked"}) | |||
private void initComponents() { | |||
private void initComponents(boolean isCombinedQuery) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a code smell - initComponents should not be changing behaviour based on a flag specific to a component.
This would be solved if we inherited from SearchHandler or dynamically injected CombinedQueryComponent using a factory pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strongly agree with the smell!
import org.junit.BeforeClass; | ||
|
||
/** | ||
* The CombinedQueryComponentTest class is a unit test suite for the CombinedQueryComponent in Solr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add tests for queries returning no results and score ties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to add user input for ordering the docs across multiple queries in case of tie.
// cosine distance vector1= 0.970 | ||
docs.get(6).addField(vectorField, Arrays.asList(5f, 10f, 20f, 40f)); | ||
// cosine distance vector1= 0.515 | ||
docs.get(7).addField(vectorField, Arrays.asList(120f, 60f, 30f, 15f)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add test for the RRF score calculation explainability
while (docs.hasNext() && ranking <= upTo) { | ||
int docId = docs.nextDoc(); | ||
float rrfScore = 1f / (k + ranking); | ||
docIdToScore.compute(docId, (id, score) -> (score == null) ? rrfScore : score + rrfScore); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is assuming that each query returns upTo number of documents - what happens when a query returns lesser number of documents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, only docs.size() number of documents are ranked.
totalMatches = Math.max(totalMatches, rankedList.matches()); | ||
int ranking = 1; | ||
while (docs.hasNext() && ranking <= upTo) { | ||
int docId = docs.nextDoc(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upTo limit is per query, not a global top N. In fusion part, we return all unique docs across all subqueries. Where are we enforcing user specified top N limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As usual, User-specified top N is being enforced at the shard level here
and search index level at SolrIndexSearcher by setting the nums and offset in SortSpec.
} | ||
} | ||
List<Map.Entry<Integer, Float>> sortedByScoreDescending = | ||
docIdToScore.entrySet().stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is essentially number of queries * upto. Have we scale tested this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to enforce some limit on the no of queries to avoid a burst of queries. Adding..
|
||
int combinedResultsLength = docIdToScore.size(); | ||
int[] combinedResultsDocIds = new int[combinedResultsLength]; | ||
float[] combinedResultScores = new float[combinedResultsLength]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about early termination for non competitive iterators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early termination is not applicable in this context, as the complete set of documents is required for the RRF (Reciprocal Rank Fusion) algorithm to function correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really glad to see this work began by acknowledging the existing work and trying to address the pitfalls!
* @return a map of shard documents, where the keys are the shard IDs as strings, and the values | ||
* are the corresponding ShardDoc objects | ||
*/ | ||
protected Map<Object, ShardDoc> createShardResult( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the key be a String and not an Object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be String but the ResponseBuilder has type Object so had to keep it Object.
@@ -240,7 +241,7 @@ public void changed(SolrPackageLoader.SolrPackage pkg, Ctx ctx) { | |||
} | |||
|
|||
@SuppressWarnings({"unchecked"}) | |||
private void initComponents() { | |||
private void initComponents(boolean isCombinedQuery) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strongly agree with the smell!
ShardFieldSortedHitQueue queue, | ||
Map<String, List<ShardDoc>> shardDocMap, | ||
SolrDocumentList responseDocs) { | ||
Map<Object, ShardDoc> resultIds = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See org.apache.solr.common.util.CollectionUtil#newHashMap
and pre-size
* The QueryAndResponseCombiner class is an abstract base class for combining query results and | ||
* responses. It provides a framework for different algorithms to be implemented for merging ranked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
results & responses -- seem synonymous to me.
Hi @ercsonusharma , thanks for resurrecting this, didn't have time to dedicate to the feature in the last few months, good to see some movement! In the next couple of weeks, I should be able to give it a go and review it! |
https://issues.apache.org/jira/browse/SOLR-17319
Description
This feature aims to execute multiple queries of multiple kinds across multiple shards of a collection and combine their result basis an algorithm (like Reciprocal Rank Fusion). It also help resolve the issues being discussed w.r.t the previous PR, mainly around across shard documents merging. It provides more flexibility in terms of querying extending JSON Query DSL ultimately enabling Hybrid Search in a pure way solving the shortcomings.
This feature is currently not supported for non-distributed and grouping query.
Solution
Tests
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.