Tweaking ED pipeline

Hi,

Thanks again for the great work!

I am currently evaluating REL for ED purposes and comparing it against other ED techniques, chiefly against [BLINK](https://github.com/facebookresearch/BLINK) from Facebook AI Research. They both take into account the context in which a mention occurs, are two-staged, and use neural approaches. BLINK does well, but can be slow and requires a GPU to run, which is a limitation for me.

Although REL is fast and lightweight, I find that it often misses a few obvious cases. I am looking for some guidance as to how I can tweak the internal workings of REL to achieve accurate results.

The following results have been obtained by running REL on a podcast description and a particular episode description - separated by a newline.

That is, in the code

```python
text_doc = podcast_summary + '\n' + episode_summary
el_result = requests.post(API_URL, json={
    "text": text_doc,
    "spans": []
}).json()
```

* For [this episode](https://podcasts.apple.com/us/podcast/304-shadi-hamid-the-problem-of-democracy/id1474687988?i=1000583791951), mention `Shadi Hamid` is identified as `Brookings_Institution` with score `0.9991938769817352` and NER tag `PER`. This is particularly egregious. Shadi Hamid's Wikipedia page is not being returned as the 1st candidate.

* For [this episode](https://podcasts.apple.com/us/podcast/gossip-why-men-cheat/id1556185319?i=1000582450036), mention `Lauren Bonner` from the podcast description is being identified as `Lauren_Samuels` with score `0.9993583559989929` even though the last names are quite different while mention `Ray J` is (correctly) identified as `Ray_J` albeit with a lower score `0.8136761486530304`.

* For [this episode](https://podcasts.apple.com/us/podcast/politricks-as-usual/id1585424238?i=1000583482355), mention `Charlamagne Tha God` from the podcast description gets only `0.7140538295110067` score even though words like `comedians, outspoken celebrities, and thought-leaders` appear in the context (which should make it easy to match his embedding learned from his Wikipedia profile which contains similar words).

* For [this episode](https://podcasts.apple.com/us/podcast/our-unimpressive-elites/id833706616?i=1000583869886), mention `Dave Smith` is always identified as `Dave_Smith_(engineer)` with very high confidence, even though `Dave_Smith_(comedian)`, the correct answer appears in the candidate set and has even words such as `government, foreign policy, and all things Libertarian` in the context which should have had a greater match with his description on [Wikipedia](https://en.wikipedia.org/wiki/Dave_Smith_(comedian)).

The last point is particularly important since `Dave Smith` is quite a common name and there are at least 4 `Dave Smith`s in Wikipedia - but with very differing descriptions.

Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweaking ED pipeline #121

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Tweaking ED pipeline #121

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions