You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Move non-abbreviation tokens that should not be split from single_token_abbreviations_<LANG>.txt to single_tokens_<LANG>.txt and add cellular networks generations (issue #32).
New feature: SoMaJo can output character offsets for tokens, allowing for stand-off tokenization. Pass character_offsets=True to the constructor or use the option --character-offsets on the command line to enable the feature. The character offsets are determined by aligning the tokenized output with the input, therefore activating the feature incurs a noticeable increase in processing time.
Potentially breaking change: The somajo-tokenizer script is automatically created upon installation and bin/somajo-tokenizer is removed. For most users, this does not make a difference. If you used to run your own modified version of SoMaJo directly via bin/somajo-tokenizer, consider installing the project in editable mode (see Development section in README.md).
Switch from setup.py to pyconfig.toml and restructure the project (source in src, tests in tests).
When creating a Token object, only known token classes can be passed.
Improvements to tokenization: Roman ordinals, abbreviation “Art.” preceding a number, certain units of measurement at the end of a sentence (e.g. km/h).