Skip to content

Conversation

@mworrell
Copy link
Member

@mworrell mworrell commented Nov 13, 2025

This pull request introduces a new module, z_string_normalize, which provides comprehensive Unicode string normalization and transliteration to ASCII, supporting multiple languages and custom word mappings. It also adds a test to verify specific word normalization behavior. This module has been split from z_string.

In z_string the functions to_lower/1 and to_upper/1 now use string:foldcase/1 and string:uppercase/1 instead of their own mappings.

String normalization and transliteration:

  • Added the new module z_string_normalize with a normalize/1 function that lowercases, sanitizes, and transliterates Unicode strings to ASCII, including support for Cyrillic, Ukrainian, Polish, Turkish, and Hebrew scripts. The normalization also handles HTML entities and various accented characters.
  • Implemented a mechanism to load and cache custom word mappings from a CSV file using persistent_term, allowing for efficient and customizable normalization of specific words (e.g., language-specific transliterations).

Testing:

  • Added a test normalize_map_words_test to ensure that the normalization correctly maps "Одесса" (in Cyrillic) to "odesa" using the custom word mapping.

@mworrell mworrell requested a review from vkatsuba November 13, 2025 11:24
@mworrell mworrell self-assigned this Nov 13, 2025
@mworrell mworrell requested review from Copilot and mmzeeman November 13, 2025 11:38
Copilot finished reviewing on behalf of mworrell November 13, 2025 11:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces a new module for comprehensive Unicode string normalization and refactors the existing case conversion functions to use Erlang's built-in Unicode-aware string functions.

Key Changes:

  • Created z_string_normalize module with normalize/1 function that performs lowercasing via string:casefold/1, sanitization, and transliteration to ASCII for multiple language scripts
  • Refactored z_string:to_lower/1 and to_upper/1 to use string:casefold/1 and string:uppercase/1 respectively, replacing custom character-by-character conversion logic
  • Added customizable word mapping system via CSV file that uses persistent_term for efficient lookups, enabling language-specific transliterations (e.g., Ukrainian city names)

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/z_string_normalize.erl New module implementing Unicode normalization with transliteration rules for Cyrillic, Ukrainian, Polish, Turkish, and Hebrew scripts, plus custom word mapping support
src/z_string.erl Simplified to_lower/1 and to_upper/1 to delegate to built-in Erlang string functions; updated normalize/1 to call new z_string_normalize module
test/z_string_test.erl Added test case to verify word mapping functionality (Cyrillic "Одесса" → "odesa")
priv/normalize-words-mapping.csv CSV file containing custom word mappings for city names in multiple languages
src/z_mochinum.erl Minor test improvement: added explicit positive sign to floating-point zero literal for clarity

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mworrell mworrell merged commit 7a8142e into master Nov 13, 2025
3 checks passed
@mworrell mworrell deleted the translit-map-names branch November 13, 2025 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants