-
Notifications
You must be signed in to change notification settings - Fork 174
Salesforce and Google Drive mappings #3613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -0,0 +1 @@ | |||
3.11.11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh! I will delete this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just got this added to the .gitignore
a few days ago
@@ -0,0 +1,744 @@ | |||
{ | |||
"mappings": { | |||
"_meta": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to expound on this at all? Something like: it can contain support cases, sales opportunities...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short answer - yes.
Long answer:
I initially thought _meta
fields were beholden to the 50 char limit that field-level metadata has, but turns out they're not! Abhi's experiments have shown that long-form mapping descriptions do have benefit, so expanding on top-level _meta
fields is going to be easy.
The per-field char limit should probably be increased to match - Sean has a draft PR here that has been on my radar to possibly pick up and push through.
"properties": { | ||
"Id": { | ||
"type": "keyword" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird that we use different casing for these. But, I imagine you inherited this schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this is the data shape that comes from our Salesforce connector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an instance where we've previously wanted to maintain the output of a given source so that it feels familiar to the customer (Maybe they know Account.Id
), but we can afford to be more opinionated now.
I'm not saying these shouldn't be capitalized - just don't feel beholden to what comes out of our Salesforce connector. We can rename fields with Ingest Pipelines.
} | ||
} | ||
}, | ||
"body": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why we use sparse embedding here and semantic text in google drive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, both are semantic_text
.
model_settings.task_type : sparse_embedding
is the a default setting for semantic_text
that gets set when you create an index and have not set it to something else manually. So even if we didn't specify this, I think Kibana will automatically populate it to this unless we manually specify something else.
So, a mapping that is written as...
{
"mappings": {
"properties": {
"field1": {
"type": "semantic_text"
}
}
}
}
...gets turned into the following when the index is created:
{
"mappings": {
"properties": {
"field1": {
"inference_id": ".elser-2-elasticsearch",
"model_settings": {
"service": "elasticsearch",
"task_type": "sparse_embedding"
},
"type": "semantic_text"
}
}
}
}
This is the behavior I've seen at any rate. It would be worth making sure the mappings are consistent in how semantic_text
field are defined, so it's a good callout regardless.
As always, it might also be worth researching to see if there is any fine-tuning we can do with the settings!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great progress! A few loosely held suggestions.
"mime_type": { | ||
"type": "keyword" | ||
}, | ||
"name": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might suggest title
, as I think that might be more common across data sources. We don't have to use a shared schema, but if we see opportunities to create overlap, I don't think it'll hurt.
"type": "text", | ||
"meta": { | ||
"description": "This is the name of a document." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grey area, but I'd suggest semantic_text
for file names/titles. Like a file might be "Earnings Report.pdf", but it would be great for it to show up in a search for "Quarterly profits"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though - maybe we want both? I could see someone also expecting exact-match to work.
}, | ||
"type": "text", | ||
"meta": { | ||
"description": "A document with this field is from a shared drive" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it sound like it should be a boolean. Is this the name of its shared drive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're missing meta.description values for a lot of the properties.
"properties": { | ||
"Id": { | ||
"type": "keyword" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an instance where we've previously wanted to maintain the output of a given source so that it feels familiar to the customer (Maybe they know Account.Id
), but we can afford to be more opinionated now.
I'm not saying these shouldn't be capitalized - just don't feel beholden to what comes out of our Salesforce connector. We can rename fields with Ingest Pipelines.
"inference_id": ".elser-2-elasticsearch", | ||
"model_settings": { | ||
"service": "elasticsearch", | ||
"task_type": "sparse_embedding" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd scrap all this. If we leave just "type": "semantic_text"
, then when the relevance team changes the default model/settings, new indices will pick up the new defaults. Less maintenance for us.
"ConvertedAccount": { | ||
"type": "object" | ||
}, | ||
"ConvertedContact": { | ||
"type": "object" | ||
}, | ||
"ConvertedOpportunity": { | ||
"type": "object" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we don't really know what these are. If we don't know what they are and don't know why we'd want to search them, let's not even index them. We can drop this stuff in an ingest pipeline.
"BccAddress": { | ||
"type": "text" | ||
}, | ||
"CcAddress": { | ||
"type": "text" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd think email addresses should be keyword
, since we probably only care about exact match when searching. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mapping is huge. I'd suggest that we try to pair down how much we keep for WorkChat. We want it to have plenty of context, and plenty of ways to filter its search results, but if we can't imagine a realistic way that we might leverage a field for search or for valuable result insights - let's remove it for now. Adding fields later will be much easier than deleting fields later.
Part of https://github.com/elastic/search-team/issues/10779 and https://github.com/elastic/search-team/issues/10780
This PR adds two explicit mappings for Salesforce and Google Drive connectors under the
agent
folder.This is preliminary work to eventually have agentless
connectors
create fresh indices using these mappings.Checklists
Pre-Review Checklist
config.yml.example
)