Skip to content

Salesforce and Google Drive mappings #3613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

Conversation

mattnowzari
Copy link
Contributor

Part of https://github.com/elastic/search-team/issues/10779 and https://github.com/elastic/search-team/issues/10780

This PR adds two explicit mappings for Salesforce and Google Drive connectors under the agent folder.
This is preliminary work to eventually have agentless connectors create fresh indices using these mappings.

Checklists

Pre-Review Checklist

  • this PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check config.yml.example)
  • this PR has a meaningful title
  • this PR links to all relevant github issues that it fixes or partially addresses
  • this PR has a thorough description

@@ -0,0 +1 @@
3.11.11
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh! I will delete this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just got this added to the .gitignore a few days ago

@@ -0,0 +1,744 @@
{
"mappings": {
"_meta": {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to expound on this at all? Something like: it can contain support cases, sales opportunities...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer - yes.

Long answer:
I initially thought _meta fields were beholden to the 50 char limit that field-level metadata has, but turns out they're not! Abhi's experiments have shown that long-form mapping descriptions do have benefit, so expanding on top-level _meta fields is going to be easy.

The per-field char limit should probably be increased to match - Sean has a draft PR here that has been on my radar to possibly pick up and push through.

"properties": {
"Id": {
"type": "keyword"
},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird that we use different casing for these. But, I imagine you inherited this schema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is the data shape that comes from our Salesforce connector

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an instance where we've previously wanted to maintain the output of a given source so that it feels familiar to the customer (Maybe they know Account.Id), but we can afford to be more opinionated now.

I'm not saying these shouldn't be capitalized - just don't feel beholden to what comes out of our Salesforce connector. We can rename fields with Ingest Pipelines.

}
}
},
"body": {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we use sparse embedding here and semantic text in google drive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, both are semantic_text.

model_settings.task_type : sparse_embedding is the a default setting for semantic_text that gets set when you create an index and have not set it to something else manually. So even if we didn't specify this, I think Kibana will automatically populate it to this unless we manually specify something else.

So, a mapping that is written as...

{
  "mappings": {
    "properties": {
      "field1": {
        "type": "semantic_text"
      }
    }
  }
}

...gets turned into the following when the index is created:

{
  "mappings": {
    "properties": {
      "field1": {
        "inference_id": ".elser-2-elasticsearch",
        "model_settings": {
          "service": "elasticsearch",
          "task_type": "sparse_embedding"
        },
        "type": "semantic_text"
      }
    }
  }
}

This is the behavior I've seen at any rate. It would be worth making sure the mappings are consistent in how semantic_text field are defined, so it's a good callout regardless.

As always, it might also be worth researching to see if there is any fine-tuning we can do with the settings!

Copy link
Member

@seanstory seanstory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress! A few loosely held suggestions.

"mime_type": {
"type": "keyword"
},
"name": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might suggest title, as I think that might be more common across data sources. We don't have to use a shared schema, but if we see opportunities to create overlap, I don't think it'll hurt.

Comment on lines +32 to +34
"type": "text",
"meta": {
"description": "This is the name of a document."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grey area, but I'd suggest semantic_text for file names/titles. Like a file might be "Earnings Report.pdf", but it would be great for it to show up in a search for "Quarterly profits"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though - maybe we want both? I could see someone also expecting exact-match to work.

},
"type": "text",
"meta": {
"description": "A document with this field is from a shared drive"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it sound like it should be a boolean. Is this the name of its shared drive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're missing meta.description values for a lot of the properties.

"properties": {
"Id": {
"type": "keyword"
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an instance where we've previously wanted to maintain the output of a given source so that it feels familiar to the customer (Maybe they know Account.Id), but we can afford to be more opinionated now.

I'm not saying these shouldn't be capitalized - just don't feel beholden to what comes out of our Salesforce connector. We can rename fields with Ingest Pipelines.

Comment on lines +188 to +192
"inference_id": ".elser-2-elasticsearch",
"model_settings": {
"service": "elasticsearch",
"task_type": "sparse_embedding"
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd scrap all this. If we leave just "type": "semantic_text", then when the relevance team changes the default model/settings, new indices will pick up the new defaults. Less maintenance for us.

Comment on lines +236 to +244
"ConvertedAccount": {
"type": "object"
},
"ConvertedContact": {
"type": "object"
},
"ConvertedOpportunity": {
"type": "object"
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we don't really know what these are. If we don't know what they are and don't know why we'd want to search them, let's not even index them. We can drop this stuff in an ingest pipeline.

Comment on lines +289 to +294
"BccAddress": {
"type": "text"
},
"CcAddress": {
"type": "text"
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think email addresses should be keyword, since we probably only care about exact match when searching. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mapping is huge. I'd suggest that we try to pair down how much we keep for WorkChat. We want it to have plenty of context, and plenty of ways to filter its search results, but if we can't imagine a realistic way that we might leverage a field for search or for valuable result insights - let's remove it for now. Adding fields later will be much easier than deleting fields later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants