Enhancing Semantic Search (Text to Vector) in Solr

Introduction

With Solr 9.9.0, you get end-to-end semantic vector search built directly into Solr. By using a service like Hugging Face or Cohere, you can configure your model within Solr and get semantic search up and running quickly. Vectorization, for both indexing and search, happens entirely within Solr, so there's no need to manually handle vectors anymore. Despite this being a big step forward, you are still limited to third party Language Models hosted in the cloud from a few supported providers.

This might be suitable to get up and running quickly, but has a number of drawbacks, such as:

Limited to models from supported providers**
Privacy (Your data is sent to the cloud provider)**
Latency

I’ve expanded on what Solr provides out of the box, allowing you to easily connect your custom model running on any endpoint.

Additional Features

On top of that, I’ve added two new features that make vector search even more powerful.

Support for Multiple Fields

You can combine multiple text fields into a single vector representation, for example, product-name, product-description, marketing-text, etc., can all be combined into a single vector. Simply configure the additional fields you want to include in your vector.

Lazy Vectorization

When updating a document in Solr, the new vector for that document is only recreated if the content of a field used for the vector has changed. This is achieved by fetching the existing document within the update process and checking if any field used in the vector embedding has actually changed. The vector is only regenerated if any of the relevant fields has changed. This minimizes expensive creation of vectors.

Setup

Create Custom Service

Spin up your own service providing an enpoint to for vector embedding. Here we use a little Python script downloading the model 'WhereIsAI/UAE-Large-V1' from Hugging Face

from fastapi import FastAPI, Request
from pydantic import BaseModel
from sentence_transformers import SentenceTransformer

app = FastAPI()
model = SentenceTransformer("WhereIsAI/UAE-Large-V1")

class TextInput(BaseModel):
    inputs: str

@app.post("/embed")
def embed_text(input: TextInput):
    embedding = model.encode(input.inputs).tolist()
    return  embedding

Configure Solr for Vector Search

Configuration and Schema

Configure a vector field within you schema.xml

  <fieldType name="vector_1024" class="solr.DenseVectorField" vectorDimension="1024"                    
				   similarityFunction="cosine"  knnAlgorithm="hnsw" hnswMaxConnections="16" hnswBeamWidth="100"/>
  <field name="vector_en" type="vector_1024" indexed="true" stored="true" multiValued="false" />

Add UpdateProcessor to solrconfig.xml and register it in your updateRequestProcessorChain. Here you can add the additiona fiedls, in this example we add manu_s and description_s. Thus he vector embedding will bee create out of the name_s, manu_s and description_s

  <updateProcessor name="textToVector" class="custom.solr.llm.textvectorisation.update.processor.LazyMultiFieldTextToVectorUpdateProcessorFactory">
   <str name="inputField">name_s</str>
   <str name="additionalInputField">manu_s,description_s</str>
   <str name="outputField">vector_en</str>
   <str name="model">customLocal</str>
  </updateProcessor>
  
  <!-- The update.autoCreateFields property can be turned to false to disable schemaless mode -->
  <updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:true}"
           processor="uuid,textToVector,remove-blank,field-name-mutating,max-fields,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
    <processor class="solr.LogUpdateProcessorFactory"/>
    <processor class="solr.DistributedUpdateProcessorFactory"/>
    <processor class="solr.RunUpdateProcessorFactory"/>
  </updateRequestProcessorChain>

Configure Custom Model

Create you custom model conofiguration and push it to Solr

{
  "class": "custom.solr.llm.textvectorisation.model.CustomEmbeddingModel",
  "name": "customLocal",
  "params": {
    "endpointUrl": "http://localhost:8000/embed",
    "fieldName": "inputs"
  }
}

Push Json above to Solr model store

curl -XPUT 'http://localhost:8983/solr/techproducts/schema/text-to-vector-model-store' --data-binary "@custom.json" -H 'Content-type:application/json'

Deploy jar with custom classes to Solr solrcustomeembeddingmodel.jar This jar just contains following classes:
CustomModel.java
LazyMultiFieldTextToVectorUpdateProcessor.java
LazyMultiFieldTextToVectorUpdateProcessorFactory.java

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.idea		.idea
artifact		artifact
gradle/wrapper		gradle/wrapper
resources		resources
src/main/java/custom/solr/llm/textvectorisation		src/main/java/custom/solr/llm/textvectorisation
.gitignore		.gitignore
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Enhancing Semantic Search (Text to Vector) in Solr

Introduction

Additional Features

Support for Multiple Fields

Lazy Vectorization

Setup

Create Custom Service

Configure Solr for Vector Search

Configuration and Schema

Configure Custom Model

About

Uh oh!

Releases

Packages

Languages

renatoh/solrCustomEmbeddingModel

Folders and files

Latest commit

History

Repository files navigation

Enhancing Semantic Search (Text to Vector) in Solr

Introduction

Additional Features

Support for Multiple Fields

Lazy Vectorization

Setup

Create Custom Service

Configure Solr for Vector Search

Configuration and Schema

Configure Custom Model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages