-
Notifications
You must be signed in to change notification settings - Fork 128
feat: comprehensive multilingual word error correction system #519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
feat: comprehensive multilingual word error correction system #519
Conversation
10f5a31
to
ffa3e14
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I was not aware the Intl API is available in node.
Yes, that's cool. So we don't need any external dependencies for that. It just feels a bit wrong to use brute force to find out which languages are supported by Intl. |
Yeah, but js engine local only and at build time. So 🫣 |
…r_correction.js``
…endencies - Replacement for removed non-functional concept script `gen_word_error_correction.js` with production-ready implementation - Without external dependencies - Use native `Intl.DateTimeFormat` for date/time formatting - Use native `Intl.DisplayNames` for dynamic language name resolution - Add dynamic locale discovery covering 140+ languages - Implement ambiguous word detection with warning system
These entries are now automatically generated.
…ion system details
… and ambiguous words
- Increase supported languages from 146 to 244 - Maintain low conflict rate with only on additional ambiguous word detected
fe9b4c9
to
95b6166
Compare
This is replacing the non-functioning concept script (
gen_word_error_correction.js
) with a working one (gen_word_error_correction.mjs
) that is integrated into the build process 🥳It retrieves all month and week names from the Intl API and packs them together with the manual definitions (in
word_error_correction_manual.yaml
) intoword_error_correction.yaml
.This means that we support over 240 languages with that, compared to only a handful previously 🤯 And we don't even have to worry about maintaining the strings, as they are always queried dynamically during the build 😁
Example
Before, the string
月曜日-金曜日 09:00-17:00
was not usable.Before
Prettified: not possible
Warnings:
After
Prettified:
Mo-Fr 09:00-17:00
Warnings:
Short names
With my last commit, I also added short names. However, I had to filter out ambiguous names because there were too many of them. Without filtering them out, we would receive a large number of such warnings:
I'm really glad to be creating this PR now. It took me a lot of time 😴. Since the tests look good, I'd actually like to merge it right away. But since it's quite a significant change, I'll wait a little for your feedback @ypid 🙂