Skip to content

Conversation

tarushi2k2
Copy link
Contributor

Adds Hindi ITN support for the Telephone semiotic class, mixed/exception fractions, quarterly measures, and century ordinals in the Date class. Includes all updates, test cases, and rebase with the latest upstream changes.

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

@tarushi2k2 tarushi2k2 marked this pull request as ready for review July 8, 2025 08:43
३० तीस
३१ इकतीस
३१ इकतिस
३१ इकत्तीस
३१ इकत्तिस
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have the same term multiple times in this tsv? is this necessary?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, are these mappings any different than cardinals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We have the same terms in the TSV file because of different spellings and character differences. I kept all the other versions on purpose because inverse text normalization allows many-to-one mapping. Having all the versions makes it work better and more accurately.

  2. I added the numbers used for dates in a separate file because the date semiotic class only needs numbers from 1 to 31. For cardinal numbers, we already have two separate files: one for single digits and another called teens and ties for numbers from 10 to 99. So it was easier and cleaner to create a new TSV file just for dates instead of using the existing cardinal number files.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there no way to optimize (1) with rules instead of one long tsv file?

let's use the cardinal graph and restrict inputs to 1-31 for (2), that will be cleaner and easier to maintain in the future

Copy link
Contributor Author

@tarushi2k2 tarushi2k2 Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Numbers 0-99 have unique words in Hindi that cannot be represented by grammars.
  2. I’ve deleted the date_days.tsv file and updated it to use the cardinal graph instead. The inputs are now restricted to 1–31 as suggested. Will push it as soon as all comments on the PR are resolved.

@@ -0,0 +1,596 @@
११ one one
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's implement graphs to process digits instead of having a long tsv file

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this completed? if so, let's remove this file

@@ -0,0 +1,2750 @@
११ एक एक
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's either try to use the cardinals graph or create a graph for this one as well

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this completed? if so, let's remove this file

@@ -0,0 +1 @@
९१ nine one
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as for std codes

@@ -0,0 +1 @@
९१ नौ एक
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as for std codes

@@ -0,0 +1,7 @@
२ दो
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these any different than cardinals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these are also different from cardinals because landline operator digits are specifically only 2, 3, 4, and 6. In India, landline numbers must start with one of these digits to be valid. The cardinal numbers have a single file, digits.tsv, which contains all numbers from 1 to 9. Since we do not need all these digits for landline operators, it made sense to create a separate TSV file specifically for the landline operator digits.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can import the carinal graph and restrict inputs -- this is necessary to optimize upkeep in the future

only create new files when the mapping is different

@@ -23,5 +23,6 @@
२० बीस
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these any different than cardinals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this TSV file for time is separate because time only requires numbers from 1 to 24. For cardinal numbers, we already have two separate files: one for single digits and another called teens and ties for numbers from 10 to 99. Since the cardinal numbers are split across two files, using one and then extracting only the numbers from 10 to 24 from the other seemed more complex. So it was simpler and cleaner to create a dedicated TSV file for time instead of reusing and modifying the existing cardinal files.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as for dates

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve deleted the hour.tsv file and updated it to use the cardinal graph instead. The inputs are now restricted to 1–24 as suggested. Will push it as soon as all comments on the PR are resolved.



def load_column_from_tsv(filepath, column_index=1):
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? doesn't pynini have a function to get the inputs or outputs only?

Copy link
Contributor Author

@tarushi2k2 tarushi2k2 Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used load_column_from_tsv because it reads just one column from the TSV file and gives a list of STD codes. This list is needed to build the telephone FST with those codes and landline numbers.
Pynini has functions like string_file that load the whole TSV file as one big FST with all pairs together. You can get inputs or outputs by inverting the FST, but these work on FSTs that are already made, not directly on the file.
When I tried replacing load_column_from_tsv with string_file, I got an error. It was harder to get a list of codes from the big FST than just using load_column_from_tsv, which gives the list straight away.
So, using load_column_from_tsv was easier and simpler, especially with a big database where I needed the full list of STD codes. That’s why I decided to use load_column_from_tsv.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please reimplement with pynini.project as used here

era_names = pynini.project(era_words, "output")

९८ निन्यान्बे
९८ निन्यानबे
९८ निन्यानवे
९८ निन्यान्वे
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also either leverage cardinal graph or optimize with rules

Copy link
Contributor Author

@tarushi2k2 tarushi2k2 Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use the cardinal graph here because the number mapping is completely different for this particular TSV file. Also, numbers 0-99 have unique words in Hindi that cannot be represented by grammars.

Copy link
Collaborator

@mgrafu mgrafu Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you saying that the "9" in 93 is a different word than the "9" in 94? what about the "4" in 34 vs the "4" in 74?

@@ -9,6 +9,7 @@
१७ सत्रह
१७ सतरह
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also optimize with rules

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numbers 0-99 have unique words in Hindi that cannot be represented by grammars.

Copy link

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Aug 20, 2025
Copy link

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants