Add project_input support for english including tests #314
+1,312
−280
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
This PR aims to add a feature which enables projection of the input to the output string.
E.g., when project_input is enabled, running ITN on the string "the road is one kilometer long" produces the output "the road is [1 km][one kilometer] long". Here the content in the left square bracket is the inverse normalized output, and the content in the right bracket which lead to the output.
This is useful in e.g. speech pipelines where ITN is used for processing the output of an ASR model, and correct word level timestamps are required of the processed output. While it is possible to align the input with the output using the fst_alignment script in the repo, this is not as robust as directly computing the input, together with the output, using the fst.
Currently, only English has been pushed, but I have it mostly working in all languages. Decided to keep the PR small initially, to make it easier to review. A full PR with support for all languages will touch most of the files in the repo.
All tests aren't currently passing, but I am working on it. The primary purpose of this PR is to gauge if this has interest for the larger community, or if I should just maintain a fork with project_input support.
The method currently relies on a custom input tag, which isn't supported by sparrowhawk. I would like to add sparrowhawk support, but currently am not sure how.
Before your PR is "Ready for review"
Pre checks:
git commit -s
to sign.pytest
or (if your machine does not have GPU)pytest --cpu
from the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')
).bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
pytest
and Sparrowhawk here.__init__.py
for every folder and subfolder, includingdata
folder which has .TSV files?Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
to all newly added Python files?Copyright 2015 and onwards Google, Inc.
. See an example here.try import: ... except: ...
) if not already done.PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.