Skip to content

Commit 9b15d44

Browse files
authored
Merge pull request #61 from telefonicasc/fix/normalizer
issue #54 - add normalizer class
2 parents 7d36f5d + 8a0341d commit 9b15d44

File tree

4 files changed

+259
-2
lines changed

4 files changed

+259
-2
lines changed

python-lib/tc_etl_lib/README.md

Lines changed: 37 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -176,9 +176,14 @@ except Exception as err:
176176
# send entities
177177
cbm: tc.cb.cbManager = tc.cb.cbManager(endpoint = 'http://<cb_endpoint>:<port>')
178178

179+
# (opcional) solo es necesario usar normalizer si los datos que se usan para
180+
# construir el entity id pueden contener caracteres prohibidos por NGSI
181+
# (acentos, paréntesis, etc)
182+
normalize = tc.normalizer()
183+
179184
entities = [
180185
{
181-
"id": "myEntity1",
186+
"id": normalize("myEntity1"),
182187
"type": "myType",
183188
"description": {
184189
"value": "My first happy entity",
@@ -194,7 +199,7 @@ entities = [
194199
}
195200
},
196201
{
197-
"id": "myEntity2",
202+
"id": normalize("myEntity2"),
198203
"type": "myType",
199204
"description": {
200205
"value": "My second happy entity",
@@ -325,6 +330,34 @@ La librería está creada con diferentes clases dependiendo de la funcionalidad
325330
- :raises [ValueError](https://docs.python.org/3/library/exceptions.html#ValueError): Se lanza cuando le falta algún argumento o inicializar alguna varibale del objeto cbManager, para poder realizar la autenticación o envío de datos.
326331
- :raises FetchError: Se lanza cuando el servicio de Context Broker, responde con un error concreto.
327332

333+
- Clase `normalizer`: Esta clase en encarga de normalizar cadenas unicode, reemplazando o eliminado cualquier caracter que no sea válido como parte de un ID de entidad NGSI.
334+
- `__init__`: constructor de objetos de la clase.
335+
- :param opcional `replacement`: define el carácter de reemplazo que sustituirá a todos los caracteres prohibidos (`&`, `?`, `/`, `#`, `<`, `>`, `"`, `'`, `=`, `;`, `(`, `)`). Esta lista de caracteres se ha extraido de https://github.com/telefonicaid/fiware-orion/blob/master/doc/manuals/orion-api.md#general-syntax-restrictions
336+
- :param opcional `override`: diccionario de pares "caracter prohibido": "carácter reemplazo", que permite especificar un reemplazo personalizado para caracteres especiales particulares. Si se usa como carácter reemplazo `""` o `None`, el caracter prohibido se borra en lugar de reemplazarse.
337+
- `__call__`: Función que ejecuta el reemplazo de los caracteres especiales.
338+
- :param: obligatorio `text`: Cadena de texto a normalizar. El normalizador devuelve una nueva cadena de texto con estos cambios:
339+
- Convierte los caracteres acentuados (á, é, í, ó, u) en sus variantes sin acento.
340+
- Elimina otros caracteres unicode no disponibles en ascii.
341+
- Elimina códigos de control ascii.
342+
- Reemplaza los caracteres prohibidos por el caracter de reemplazo (por defecto `-`, puede cambiarse con los overrides que se indican en el constructor)
343+
- Reemplaza todos los espacios en blanco consecutivos por el carácter de reemplazo.
344+
- NOTA: Esta función no recorta la longitud de la cadena devuelta a 256 caracteres, porque el llamante puede querer conservar la cadena entera para por ejemplo guardarla en algún otro atributo, antes de truncarla.
345+
346+
Algunos ejemplos de uso de `normalizer`:
347+
348+
```
349+
# Reemplazar los espacios por "+", al estilo "url encoding".
350+
# El resto de caracteres especiales, sustituirlos por el carácter
351+
# de reemplazo por defecto.
352+
norm = tc.normalizer(override={" ": "+"})
353+
norm("text (with spaces)") # devuelve "text+-with+spaces-"
354+
355+
# Eliminar directamente todos los caracteres especiales,
356+
# dejando solo los espacios (reemplazados por "-")
357+
norm = tc.normalizer(replacement="", override={" ": "-"})
358+
norm("text (with spaces)") # devuelve "text-with-spaces"
359+
```
360+
328361
La librería además proporciona [context managers](https://docs.python.org/3/reference/datamodel.html#context-managers) para abstraer la escritura de entidades en formato NGSIv2 a distintos backends (`store`s). Estos son:
329362

330363
- `orionStore`: Genera un store asociado a una instancia particular de `cbManager` y `authManager`. Todas las entidades que se envíen a este store, se almacenarán en el cbManager correspondiente.
@@ -442,6 +475,8 @@ TOTAL 403 221 45%
442475

443476
## Changelog
444477

478+
- Add: new class `normalizer` to clean up text strings to be used as NGSI entity IDs, by replacing or removing forbidden characters ([#54](https://github.com/telefonicasc/etl-framework/pull/54))
479+
445480
0.8.0 (March 22nd, 2023)
446481

447482
- Add: new optional parameter called `replace_id` in sqlFileStore context manager ([#58](https://github.com/telefonicasc/etl-framework/pull/58))

python-lib/tc_etl_lib/tc_etl_lib/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,3 +22,4 @@
2222
from .auth import authManager
2323
from .cb import FetchError, cbManager
2424
from .store import Store, orionStore, sqlFileStore
25+
from .normalizer import normalizer
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# -*- coding: utf-8 -*-
2+
# Copyright 2023 Telefónica Soluciones de Informática y Comunicaciones de España, S.A.U.
3+
#
4+
# This file is part of tc_etl_lib
5+
#
6+
# tc_etl_lib is free software: you can redistribute it and/or
7+
# modify it under the terms of the GNU Affero General Public License as
8+
# published by the Free Software Foundation, either version 3 of the
9+
# License, or (at your option) any later version.
10+
#
11+
# tc_etl_lib is distributed in the hope that it will be useful,
12+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero
14+
# General Public License for more details.
15+
#
16+
# You should have received a copy of the GNU Affero General Public License
17+
# along with IoT orchestrator. If not, see http://www.gnu.org/licenses/.
18+
#
19+
20+
import unicodedata
21+
import re
22+
23+
from typing import Mapping, Optional
24+
25+
_whitespace_re = re.compile(r"\s+")
26+
27+
class normalizer:
28+
"""
29+
Normalizer is a class that will normalize unicode strings to
30+
valid NGSI entity IDs. Normalization rules are at:
31+
32+
https://github.com/telefonicaid/fiware-orion/blob/master/doc/manuals/orion-api.md#general-syntax-restrictions
33+
34+
Normalizers have a __call__ function that takes an input string and:
35+
36+
- Turn accented characters (á, é, í, ó, u) into unaccented variants.
37+
- Remove any other unicode character not available in ascii
38+
- Remove ascii control codes
39+
- Replace forbidden characters '&', '?', '/', '#' '<', '>', '"', ''', '=', ';', '(', ')'
40+
with the replacement character (default "-", can be changed in the constructor)
41+
- Merges consecutive whitespace and replaces it with the replacement character
42+
43+
You can also set a different replacement character for a specific forbidden
44+
character, by adding the translation to the `override` optional argument of the
45+
constructor.
46+
47+
E.g. if you want to replace " " with "+", you can call:
48+
49+
```
50+
norm = normalizer(override={" ": "+"})
51+
norm("text (with spaces)")
52+
```
53+
54+
And you will get `"text+-with+spaces-"`.
55+
56+
You can also remove a forbidden character altogether, by setting its value to
57+
`None` in the `override` argument. E.g if you want to remove parenthesis,
58+
you can call:
59+
60+
```
61+
norm = normalizer(override={"(": None, ")": None})
62+
norm("text (with parenthesis)")
63+
```
64+
65+
If you want to remove ALL special characters (except whitespace):
66+
67+
```
68+
norm = normalizer(replacement="", override={ " ": "-" })
69+
norm("text (with & special > characters)")
70+
```
71+
72+
And you will get `"text-with-special-characters"`
73+
74+
The function does not trim the string size to 256 characters, because
75+
you might want the full normalized original string to store it somewhere
76+
else before truncating.
77+
"""
78+
79+
def __init__(self, replacement: str = "-", override: Optional[Mapping[str, str]] = None):
80+
"""Set the default replacement string and custom override mapping"""
81+
if override is None:
82+
override = {}
83+
forbidden_chars = {
84+
"&": replacement,
85+
"?": replacement,
86+
"/": replacement,
87+
"#": replacement,
88+
"<": replacement,
89+
">": replacement,
90+
'"': replacement,
91+
"'": replacement,
92+
"=": replacement,
93+
";": replacement,
94+
"(": replacement,
95+
")": replacement
96+
}
97+
source = []
98+
target = []
99+
remove = []
100+
for key, val in forbidden_chars.items():
101+
custom = override.get(key, val)
102+
if custom is None or custom == "":
103+
remove.append(key)
104+
else:
105+
if len(custom) > 1:
106+
raise ValueError(f"wrong override '{custom}' for char '{key}': must be a single character")
107+
source.append(key)
108+
target.append(custom)
109+
self.space_replacement = override.get(" ", replacement) or ""
110+
self.table = str.maketrans(
111+
"".join(source), "".join(target), "".join(remove))
112+
113+
def __call__(self, text: str) -> str:
114+
"""Normalize text to NGSI entity ID"""
115+
global _whitespace_re
116+
ascii = unicodedata.normalize('NFD', text).encode('utf-8').decode('ascii', errors='ignore')
117+
without_control_chars = "".join(ch for ch in ascii if unicodedata.category(ch)[0] != "C")
118+
without_specials = without_control_chars.translate(self.table).strip()
119+
return _whitespace_re.sub(self.space_replacement, without_specials)
Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# -*- coding: utf-8 -*-
2+
#
3+
# Copyright 2023 Telefónica Soluciones de Informática y Comunicaciones de España, S.A.U.
4+
#
5+
# This file is part of tc_etl_lib
6+
#
7+
# tc_etl_lib is free software: you can redistribute it and/or
8+
# modify it under the terms of the GNU Affero General Public License as
9+
# published by the Free Software Foundation, either version 3 of the
10+
# License, or (at your option) any later version.
11+
#
12+
# tc_etl_lib is distributed in the hope that it will be useful,
13+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
14+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero
15+
# General Public License for more details.
16+
#
17+
# You should have received a copy of the GNU Affero General Public License
18+
# along with IoT orchestrator. If not, see http://www.gnu.org/licenses/.
19+
20+
'''
21+
Normalizer tests
22+
'''
23+
24+
import unittest
25+
from tc_etl_lib import normalizer
26+
27+
28+
class TestNormalizer(unittest.TestCase):
29+
'''Tests for sqlFileStore'''
30+
31+
def do_test(self, replacement="-", override=None, input="", expected=""):
32+
'''test Normalizer with the given options dict'''
33+
norm = normalizer(replacement=replacement, override=override)
34+
result = norm(input)
35+
self.assertEqual(result, expected)
36+
37+
def test_default_behaviour(self):
38+
self.do_test(
39+
replacement="-",
40+
override=None,
41+
input="text (with & specials) > áéíóúñ",
42+
expected="text--with---specials----aeioun"
43+
)
44+
45+
def test_different_replacement(self):
46+
self.do_test(
47+
replacement="_",
48+
override=None,
49+
input="text (with & specials) > áéíóúñ",
50+
expected="text__with___specials____aeioun"
51+
)
52+
53+
def test_space_override(self):
54+
self.do_test(
55+
replacement="-",
56+
override={" ": "+"},
57+
input="text (with & specials) > áéíóúñ",
58+
expected="text+-with+-+specials-+-+aeioun"
59+
)
60+
61+
def test_forbidden_override(self):
62+
self.do_test(
63+
replacement="-",
64+
override={">": "+"},
65+
input="text (with & specials) > áéíóúñ",
66+
expected="text--with---specials--+-aeioun"
67+
)
68+
69+
def test_forbidden_remove_some(self):
70+
self.do_test(
71+
replacement="-",
72+
override= {
73+
"(": None,
74+
")": None,
75+
},
76+
input="text (with & specials) > áéíóúñ",
77+
expected="text-with---specials---aeioun"
78+
)
79+
80+
def test_space_remove(self):
81+
self.do_test(
82+
replacement="-",
83+
override= {" ": None },
84+
input="text (with & specials) > áéíóúñ",
85+
expected="text-with-specials--aeioun"
86+
)
87+
88+
def test_forbidden_remove_all(self):
89+
self.do_test(
90+
replacement="",
91+
override={ " ": "-" },
92+
input="text (with & specials) > áéíóúñ",
93+
expected="text-with-specials-aeioun"
94+
)
95+
96+
def test_remove_all(self):
97+
self.do_test(
98+
replacement="",
99+
override=None,
100+
input="text (with & specials) > áéíóúñ",
101+
expected="textwithspecialsaeioun"
102+
)

0 commit comments

Comments
 (0)