-
Notifications
You must be signed in to change notification settings - Fork 14
Use Erlang string case folding. Add word mappings to normalize #111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request introduces a new module for comprehensive Unicode string normalization and refactors the existing case conversion functions to use Erlang's built-in Unicode-aware string functions.
Key Changes:
- Created
z_string_normalizemodule with normalize/1 function that performs lowercasing viastring:casefold/1, sanitization, and transliteration to ASCII for multiple language scripts - Refactored
z_string:to_lower/1andto_upper/1to usestring:casefold/1andstring:uppercase/1respectively, replacing custom character-by-character conversion logic - Added customizable word mapping system via CSV file that uses
persistent_termfor efficient lookups, enabling language-specific transliterations (e.g., Ukrainian city names)
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
src/z_string_normalize.erl |
New module implementing Unicode normalization with transliteration rules for Cyrillic, Ukrainian, Polish, Turkish, and Hebrew scripts, plus custom word mapping support |
src/z_string.erl |
Simplified to_lower/1 and to_upper/1 to delegate to built-in Erlang string functions; updated normalize/1 to call new z_string_normalize module |
test/z_string_test.erl |
Added test case to verify word mapping functionality (Cyrillic "Одесса" → "odesa") |
priv/normalize-words-mapping.csv |
CSV file containing custom word mappings for city names in multiple languages |
src/z_mochinum.erl |
Minor test improvement: added explicit positive sign to floating-point zero literal for clarity |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
This pull request introduces a new module,
z_string_normalize, which provides comprehensive Unicode string normalization and transliteration to ASCII, supporting multiple languages and custom word mappings. It also adds a test to verify specific word normalization behavior. This module has been split from z_string.In z_string the functions
to_lower/1andto_upper/1now usestring:foldcase/1andstring:uppercase/1instead of their own mappings.String normalization and transliteration:
z_string_normalizewith anormalize/1function that lowercases, sanitizes, and transliterates Unicode strings to ASCII, including support for Cyrillic, Ukrainian, Polish, Turkish, and Hebrew scripts. The normalization also handles HTML entities and various accented characters.persistent_term, allowing for efficient and customizable normalization of specific words (e.g., language-specific transliterations).Testing:
normalize_map_words_testto ensure that the normalization correctly maps "Одесса" (in Cyrillic) to "odesa" using the custom word mapping.