Skip to content

Conversation

ausi
Copy link
Contributor

@ausi ausi commented Dec 2, 2024

If a word in the index is longer than the index length, deletions dot not work anymore (they incorrectly count as 1 deletion + 1 insertion).

As you can see in the test case that currently fails, the word Mustermann is indexed as Muster (index length 6) and the search for Mutermann is then calculated as if we searched for Muterm. So even though Mustermann and Mutermann only have a distance of 1, you only find them with a distance of 2 or higher because Muster and Muterm have a distance of 2.

This is an issue in the original paper itself as far as I can see. But I think it can be fixed.

@Toflar
Copy link
Owner

Toflar commented Dec 2, 2024

Had to re-enable CI as it's been 6 months. Rebasing should help to have it run.

@ausi ausi force-pushed the fix/deletion-cut-off-words branch from b703bb8 to 75523f3 Compare December 2, 2024 20:16
@ausi ausi marked this pull request as draft December 7, 2024 14:05
@ausi
Copy link
Contributor Author

ausi commented Dec 7, 2024

45a893e is an attempt to fix the issue. But it fails the testResultsMatchResearchPaper test and ends up finding way too many states. I think we need to keep track of the number of deletions in $statesStarC to only allow zero-cost substitutions for these states and not all states.

@Toflar
Copy link
Owner

Toflar commented Dec 9, 2024

Yeah I don't think that many states is an acceptable solution because it will return way too many false-positives 🤔

@ausi ausi force-pushed the fix/deletion-cut-off-words branch from eda3d4e to 88f6bca Compare August 10, 2025 18:11
@ausi ausi marked this pull request as ready for review August 10, 2025 18:14
@ausi
Copy link
Contributor Author

ausi commented Aug 10, 2025

I think we need to keep track of the number of deletions in $statesStarC to only allow zero-cost substitutions for these states and not all states.

I found a simpler way to do that as the numerical range of all the states that result from words that exceed (or match) the index length is known ahead of time. We can therefore easily check if a state falls into that category and apply a zero cost deletion for them.

I am pretty certain that this is correct and that the original research paper missed handling this edge case. I added a comment to the testResultsMatchResearchPaper method that tries to explain why it has to be correct to also find the state 1869 in the given example. And I added the word "Multere" to the example to further clarify it. Are you in contact with the authors? Maybe we should inform them of our findings?

Copy link
Owner

@Toflar Toflar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this makes sense and looks like a really simple fix to the problem! ❤️

I am pretty certain that this is correct and that the original research paper missed handling this edge case. I added a comment to the testResultsMatchResearchPaper method that tries to explain why it has to be correct to also find the state 1869 in the given example. And I added the word "Multere" to the example to further clarify it. Are you in contact with the authors? Maybe we should inform them of our findings?

No, I found this paper a while ago by chance. I don't know how we could reach them and also it's not "our findings", it's your findings 😉

@ausi ausi requested a review from Toflar August 11, 2025 21:39
@Toflar Toflar merged commit 2a541bf into Toflar:main Aug 12, 2025
5 checks passed
@Toflar
Copy link
Owner

Toflar commented Aug 12, 2025

Awesome work, thank you @ausi!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants