-
-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Open
Copy link
Labels
enhancementNew feature or requestNew feature or requestpipeline 2: extractIssue related to extracting parallel corporaIssue related to extracting parallel corporapipeline 3: preprocessIssue related to preprocessing.Issue related to preprocessing.
Description
There are some additional project characteristics / metrics that we find helpful to know for each new project. Automatically getting this info as part of the existing process would be helpful.
- The number of lines in the verse extract file that are
range
lines. - The number of characters in the verse extract file that are unknown () to the NLLB tokenizer.
- Tokenization stats for the verse extract file.
This information is already available using the current tools by running the 'preprocess' command or the 'experiment' command, but it either requires an extra step (preprocess), or is available too late (experiment) to guide the onboarding work. A streamlined method will reduce effort when working on onboarding requets.
- Wildebeest reports for the verse extract file.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestpipeline 2: extractIssue related to extracting parallel corporaIssue related to extracting parallel corporapipeline 3: preprocessIssue related to preprocessing.Issue related to preprocessing.