Use Erlang string case folding. Add word mappings to normalize #111

mworrell · 2025-11-13T11:23:54Z

This pull request introduces a new module, z_string_normalize, which provides comprehensive Unicode string normalization and transliteration to ASCII, supporting multiple languages and custom word mappings. It also adds a test to verify specific word normalization behavior. This module has been split from z_string.

In z_string the functions to_lower/1 and to_upper/1 now use string:foldcase/1 and string:uppercase/1 instead of their own mappings.

String normalization and transliteration:

Added the new module z_string_normalize with a normalize/1 function that lowercases, sanitizes, and transliterates Unicode strings to ASCII, including support for Cyrillic, Ukrainian, Polish, Turkish, and Hebrew scripts. The normalization also handles HTML entities and various accented characters.
Implemented a mechanism to load and cache custom word mappings from a CSV file using persistent_term, allowing for efficient and customizable normalization of specific words (e.g., language-specific transliterations).

Testing:

Added a test normalize_map_words_test to ensure that the normalization correctly maps "Одесса" (in Cyrillic) to "odesa" using the custom word mapping.

Copilot

Pull Request Overview

This pull request introduces a new module for comprehensive Unicode string normalization and refactors the existing case conversion functions to use Erlang's built-in Unicode-aware string functions.

Key Changes:

Created z_string_normalize module with normalize/1 function that performs lowercasing via string:casefold/1, sanitization, and transliteration to ASCII for multiple language scripts
Refactored z_string:to_lower/1 and to_upper/1 to use string:casefold/1 and string:uppercase/1 respectively, replacing custom character-by-character conversion logic
Added customizable word mapping system via CSV file that uses persistent_term for efficient lookups, enabling language-specific transliterations (e.g., Ukrainian city names)

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`src/z_string_normalize.erl`	New module implementing Unicode normalization with transliteration rules for Cyrillic, Ukrainian, Polish, Turkish, and Hebrew scripts, plus custom word mapping support
`src/z_string.erl`	Simplified `to_lower/1` and `to_upper/1` to delegate to built-in Erlang string functions; updated `normalize/1` to call new `z_string_normalize` module
`test/z_string_test.erl`	Added test case to verify word mapping functionality (Cyrillic "Одесса" → "odesa")
`priv/normalize-words-mapping.csv`	CSV file containing custom word mappings for city names in multiple languages
`src/z_mochinum.erl`	Minor test improvement: added explicit positive sign to floating-point zero literal for clarity

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/z_string_normalize.erl

Co-authored-by: Copilot <[email protected]>

Use Erlang string case folding. Add word mappings to normalize

110d919

mworrell requested a review from vkatsuba November 13, 2025 11:24

mworrell self-assigned this Nov 13, 2025

mworrell added 4 commits November 13, 2025 12:28

Add alternative lookup of word mapping file, for CI

bca4145

Add missing priv file

785b6bc

Disambiguate 0.0

a62c353

Fix word split

e26e47b

mworrell requested review from Copilot and mmzeeman November 13, 2025 11:38

Copilot started reviewing on behalf of mworrell November 13, 2025 11:38 View session

Copilot finished reviewing on behalf of mworrell November 13, 2025 11:42

vkatsuba approved these changes Nov 13, 2025

View reviewed changes

Copilot AI reviewed Nov 13, 2025

View reviewed changes

mworrell and others added 4 commits November 13, 2025 12:47

Update src/z_string_normalize.erl

78da3b2

Co-authored-by: Copilot <[email protected]>

Update src/z_string_normalize.erl

26374d6

Co-authored-by: Copilot <[email protected]>

Update src/z_string_normalize.erl

8c5b6dd

Co-authored-by: Copilot <[email protected]>

Update src/z_string_normalize.erl

92c7db5

Co-authored-by: Copilot <[email protected]>

mworrell mentioned this pull request Nov 13, 2025

Full text search: use websearch_to_tsquery and normalize texts zotonic/zotonic#4211

Open

3 tasks

mworrell merged commit 7a8142e into master Nov 13, 2025
3 checks passed

mworrell deleted the translit-map-names branch November 13, 2025 12:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use Erlang string case folding. Add word mappings to normalize #111

Use Erlang string case folding. Add word mappings to normalize #111

Uh oh!

mworrell commented Nov 13, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use Erlang string case folding. Add word mappings to normalize #111

Use Erlang string case folding. Add word mappings to normalize #111

Uh oh!

Conversation

mworrell commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mworrell commented Nov 13, 2025 •

edited

Loading