-
Notifications
You must be signed in to change notification settings - Fork 2
Issue with deletion in long words #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Had to re-enable CI as it's been 6 months. Rebasing should help to have it run. |
b703bb8
to
75523f3
Compare
45a893e is an attempt to fix the issue. But it fails the |
Yeah I don't think that many states is an acceptable solution because it will return way too many false-positives 🤔 |
eda3d4e
to
88f6bca
Compare
I found a simpler way to do that as the numerical range of all the states that result from words that exceed (or match) the index length is known ahead of time. We can therefore easily check if a state falls into that category and apply a zero cost deletion for them. I am pretty certain that this is correct and that the original research paper missed handling this edge case. I added a comment to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this makes sense and looks like a really simple fix to the problem! ❤️
I am pretty certain that this is correct and that the original research paper missed handling this edge case. I added a comment to the
testResultsMatchResearchPaper
method that tries to explain why it has to be correct to also find the state1869
in the given example. And I added the word"Multere"
to the example to further clarify it. Are you in contact with the authors? Maybe we should inform them of our findings?
No, I found this paper a while ago by chance. I don't know how we could reach them and also it's not "our findings", it's your findings 😉
Awesome work, thank you @ausi! |
If a word in the index is longer than the index length, deletions dot not work anymore (they incorrectly count as 1 deletion + 1 insertion).
As you can see in the test case that currently fails, the word
Mustermann
is indexed asMuster
(index length 6) and the search forMutermann
is then calculated as if we searched forMuterm
. So even thoughMustermann
andMutermann
only have a distance of 1, you only find them with a distance of 2 or higher becauseMuster
andMuterm
have a distance of 2.This is an issue in the original paper itself as far as I can see. But I think it can be fixed.