Issue with deletion in long words #8

ausi · 2024-12-02T17:29:22Z

If a word in the index is longer than the index length, deletions dot not work anymore (they incorrectly count as 1 deletion + 1 insertion).

As you can see in the test case that currently fails, the word Mustermann is indexed as Muster (index length 6) and the search for Mutermann is then calculated as if we searched for Muterm. So even though Mustermann and Mutermann only have a distance of 1, you only find them with a distance of 2 or higher because Muster and Muterm have a distance of 2.

This is an issue in the original paper itself as far as I can see. But I think it can be fixed.

Toflar · 2024-12-02T20:12:32Z

Had to re-enable CI as it's been 6 months. Rebasing should help to have it run.

ausi · 2024-12-07T14:08:50Z

45a893e is an attempt to fix the issue. But it fails the testResultsMatchResearchPaper test and ends up finding way too many states. I think we need to keep track of the number of deletions in $statesStarC to only allow zero-cost substitutions for these states and not all states.

Toflar · 2024-12-09T09:21:34Z

Yeah I don't think that many states is an acceptable solution because it will return way too many false-positives 🤔

ausi · 2025-08-10T18:41:40Z

I think we need to keep track of the number of deletions in $statesStarC to only allow zero-cost substitutions for these states and not all states.

I found a simpler way to do that as the numerical range of all the states that result from words that exceed (or match) the index length is known ahead of time. We can therefore easily check if a state falls into that category and apply a zero cost deletion for them.

I am pretty certain that this is correct and that the original research paper missed handling this edge case. I added a comment to the testResultsMatchResearchPaper method that tries to explain why it has to be correct to also find the state 1869 in the given example. And I added the word "Multere" to the example to further clarify it. Are you in contact with the authors? Maybe we should inform them of our findings?

Toflar

Wow, this makes sense and looks like a really simple fix to the problem! ❤️

I am pretty certain that this is correct and that the original research paper missed handling this edge case. I added a comment to the testResultsMatchResearchPaper method that tries to explain why it has to be correct to also find the state 1869 in the given example. And I added the word "Multere" to the example to further clarify it. Are you in contact with the authors? Maybe we should inform them of our findings?

No, I found this paper a while ago by chance. I don't know how we could reach them and also it's not "our findings", it's your findings 😉

src/StateSetIndex.php

Toflar · 2025-08-12T07:19:07Z

Awesome work, thank you @ausi!

Test deletion in long words

75523f3

ausi force-pushed the fix/deletion-cut-off-words branch from b703bb8 to 75523f3 Compare December 2, 2024 20:16

ausi added 2 commits December 7, 2024 14:32

Merge branch main into fix/deletion-cut-off-words

0aa64b8

Attempt to fix the cut off word issue

45a893e

ausi marked this pull request as draft December 7, 2024 14:05

ausi mentioned this pull request Jul 30, 2025

Something is wrong when adjusting index length and alphabet size #13

Closed

ausi added 3 commits August 10, 2025 19:59

Fix bug with cut-off words

3f0f670

Merge branch main into fix/deletion-cut-off-words

d414012

Remove unnecessary index check

88f6bca

ausi force-pushed the fix/deletion-cut-off-words branch from eda3d4e to 88f6bca Compare August 10, 2025 18:11

ausi marked this pull request as ready for review August 10, 2025 18:14

Toflar reviewed Aug 11, 2025

View reviewed changes

src/StateSetIndex.php Outdated Show resolved Hide resolved

ausi added 2 commits August 11, 2025 23:26

Only calculate the lower bound if needed

2bd1a91

Merge branch main into fix/deletion-cut-off-words

8b57650

ausi requested a review from Toflar August 11, 2025 21:39

Toflar approved these changes Aug 12, 2025

View reviewed changes

Toflar merged commit 2a541bf into Toflar:main Aug 12, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue with deletion in long words #8

Issue with deletion in long words #8

Uh oh!

ausi commented Dec 2, 2024

Uh oh!

Toflar commented Dec 2, 2024

Uh oh!

ausi commented Dec 7, 2024

Uh oh!

Toflar commented Dec 9, 2024

Uh oh!

ausi commented Aug 10, 2025

Uh oh!

Toflar left a comment

Uh oh!

Uh oh!

Uh oh!

Toflar commented Aug 12, 2025

Uh oh!

Uh oh!

Issue with deletion in long words #8

Issue with deletion in long words #8

Uh oh!

Conversation

ausi commented Dec 2, 2024

Uh oh!

Toflar commented Dec 2, 2024

Uh oh!

ausi commented Dec 7, 2024

Uh oh!

Toflar commented Dec 9, 2024

Uh oh!

ausi commented Aug 10, 2025

Uh oh!

Toflar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Toflar commented Aug 12, 2025

Uh oh!

Uh oh!