Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
268 changes: 268 additions & 0 deletions examples/Text Normalization using NLPurify.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,268 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "c4a37fef",
"metadata": {},
"source": [
"<h1 align = \"center\">Text Normalization</h1>\n",
"\n",
"---\n",
"In the world of Natural Language Processing (NLP), we work with human language. However, human language is inherently messy, varied, and full of nuances that can be confusing for computers. Text normalization is the foundational process of cleaning and standardizing raw text into a consistent, predictable format. Think of it as tidying up a chaotic room before you can find anything; we are tidying up language so a machine learning model can understand it.\n",
"\n",
"The primary goal is to reduce the randomness in text by grouping different variations of a word or phrase into a single, canonical form. For example, to a computer, the words \"run,\" \"Run,\" and \"running\" are three distinct items. Text normalization ensures these are all recognized as the same core concept, simplifying the data for NLP models. This preprocessing step is crucial for the success of almost all major NLP tasks, including search engines, sentiment analysis, and machine translation.\n",
"\n",
"**Why is it so Important?**\n",
"\n",
" * **Improved Model Performance:** Clean, standardized data helps models learn more effectively, leading to higher accuracy.\n",
" * **Reduced Complexity:** It significantly shrinks the vocabulary the model needs to learn, which reduces computational costs and memory usage.\n",
" * **Enhanced Feature Extraction:** When different forms of a word are treated as a single feature, the statistical power of that feature increases, leading to better insights."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "526847b9",
"metadata": {
"ExecuteTime": {
"end_time": "2025-10-23T09:26:13.253923Z",
"start_time": "2025-10-23T09:26:13.230466Z"
}
},
"outputs": [],
"source": [
"import os # miscellaneous os interfaces\n",
"import sys # configuring python runtime environment"
]
},
{
"cell_type": "markdown",
"id": "e7157cc7",
"metadata": {},
"source": [
"## NLP Libraries\n",
"\n",
"Python offers a rich ecosystem of libraries for Natural Language Processing (NLP), catering to various needs from foundational tasks to advanced deep learning models. Here are some of the most prominent ones:\n",
"\n",
" 1. [NLTK](https://www.nltk.org/) Natural Language Toolkit - a comprehensive library for foundational NLP tasks like tokenization, stemming, lemmatization, etc.\n",
" 2. [spaCy](https://spacy.io/) Industrial-Strength NLP - designed for production-level applications, emphasizing speed and efficiency."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "f6f328a9",
"metadata": {
"ExecuteTime": {
"end_time": "2025-10-23T09:26:13.269218Z",
"start_time": "2025-10-23T09:26:13.255910Z"
}
},
"outputs": [],
"source": [
"# import nltk"
]
},
{
"cell_type": "markdown",
"id": "4ab6f85c",
"metadata": {},
"source": [
"### NLPurify\n",
"\n",
"A text cleaning and extraction engine was developed using a combination of traditional techniques like Unicode translations, cleaning using regular expressions, and modern tools like \"natural language processing\"\n",
"and \"large language models\" to detect and clean long texts and create word vectors. The library is developed as an one-stop solution that modifies and collates the utility functions to provide common things at one place."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "7afba9c9",
"metadata": {
"ExecuteTime": {
"end_time": "2025-10-23T09:26:20.019613Z",
"start_time": "2025-10-23T09:26:13.274467Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Current Version: v2.1.0.dev0\n"
]
}
],
"source": [
"import nlpurify as nlpu\n",
"\n",
"# general convention is to assign the short form ``nlpu`` to the library\n",
"# print the current version of the library - for debugging and documentation\n",
"print(f\"Current Version: {nlpu.__version__}\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "4e3479e9",
"metadata": {
"ExecuteTime": {
"end_time": "2025-10-23T09:27:21.225941Z",
"start_time": "2025-10-23T09:27:21.218991Z"
}
},
"outputs": [],
"source": [
"text = '''\n",
" This is a uncLeaneD text with lots of\n",
" extra WHITE \n",
"space.\n",
"'''"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "e4d2fc66",
"metadata": {
"ExecuteTime": {
"end_time": "2025-10-23T09:37:55.741806Z",
"start_time": "2025-10-23T09:37:55.722247Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Normalized White Space: `This is a uncLeaneD text with lots of extra WHITE space.`\n"
]
}
],
"source": [
"model = nlpu.preprocessing.normalization.WhiteSpace()\n",
"print(f\"Normalized White Space: `{model.apply(text)}`\")"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "d405780a",
"metadata": {
"ExecuteTime": {
"end_time": "2025-10-23T09:42:40.080806Z",
"start_time": "2025-10-23T09:42:40.061140Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Uniform Case Folding: `\n",
" this is a uncleaned text with lots of\n",
" extra white \n",
"space.\n",
"`\n"
]
}
],
"source": [
"model = nlpu.preprocessing.normalization.CaseFolding()\n",
"print(f\"Uniform Case Folding: `{model.apply(text)}`\")"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "9f4aa5e6",
"metadata": {
"ExecuteTime": {
"end_time": "2025-10-23T09:46:21.586061Z",
"start_time": "2025-10-23T09:46:21.572034Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Uniform Case Folding: `This uncLeaneD text lots extra WHITE space .`\n"
]
}
],
"source": [
"model = nlpu.preprocessing.normalization.StopWords()\n",
"print(f\"Uniform Case Folding: `{model.apply(text)}`\")"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "ca00225a",
"metadata": {
"ExecuteTime": {
"end_time": "2025-10-23T09:46:22.408174Z",
"start_time": "2025-10-23T09:46:22.400677Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Uniform Case Folding: `['This', 'is', 'a', 'uncLeaneD', 'text', 'with', 'lots', 'extra', 'WHITE']`\n"
]
}
],
"source": [
"model = nlpu.preprocessing.utils.WordTokenize(vanilla = True, tokenizer = False, vanilla_getalnum = True)\n",
"print(f\"Uniform Case Folding: `{model.apply(text)}`\")"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "0a2d28cf",
"metadata": {
"ExecuteTime": {
"end_time": "2025-10-23T09:46:24.121050Z",
"start_time": "2025-10-23T09:46:24.101463Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"UNCLEANED TEXT LOTS EXTRA WHITE SPACE .\n"
]
}
],
"source": [
"print(nlpu.preprocessing.normalization.normalize(text, upper = True, stopwords_in_uppercase = True))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "TensorFlow CPU (v2.12.0)",
"language": "python",
"name": "tensorflow"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
12 changes: 4 additions & 8 deletions nlpurify/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,10 @@
__version__ = "v2.1.0.dev0"

# init-time options registrations
from nlpurify import preprocessing

from nlpurify.scoring import fuzzy
from nlpurify.scoring import regexp

from nlpurify.feature import (
selection as feature_selection
)

from nlpurify.normalization import (
normalize,
strip_whitespace
)
from nlpurify.feature import selection as feature_selection
from nlpurify.feature import extraction as feature_extraction
2 changes: 0 additions & 2 deletions nlpurify/feature/selection/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,3 @@
"""
Selection of Finite Set of Features/Tokens for Efficient Modelling
"""

from nlpurify.feature.selection.nltk import * # noqa: F401, F403 # pyright: ignore[reportMissingImports]
Loading