Releases · tsproisl/SoMaJo · GitHub

05 Aug 06:43

tsproisl

v2.4.3 Latest

Latest

Move non-abbreviation tokens that should not be split from single_token_abbreviations_<LANG>.txt to single_tokens_<LANG>.txt and add cellular networks generations (issue #32).

Assets 2

19 Feb 12:24

tsproisl

v2.4.2

Fix issues #28 and #29 (markdown links with trailing symbols after URL part).

Assets 2

09 Feb 08:52

tsproisl

v2.4.1

Fix issue #27 (URLs in angle brackets).

Assets 2

23 Dec 20:32

tsproisl

v2.4.0

New feature: SoMaJo can output character offsets for tokens, allowing for stand-off tokenization. Pass character_offsets=True to the constructor or use the option --character-offsets on the command line to enable the feature. The character offsets are determined by aligning the tokenized output with the input, therefore activating the feature incurs a noticeable increase in processing time.

Assets 2

23 Sep 09:10

tsproisl

v2.3.1

Fix issue #26 (markdown links that contain a URL in the link text).

Assets 2

14 Aug 18:56

tsproisl

v2.3.0

Potentially breaking change: The somajo-tokenizer script is automatically created upon installation and bin/somajo-tokenizer is removed. For most users, this does not make a difference. If you used to run your own modified version of SoMaJo directly via bin/somajo-tokenizer, consider installing the project in editable mode (see Development section in README.md).
Switch from setup.py to pyconfig.toml and restructure the project (source in src, tests in tests).
When creating a Token object, only known token classes can be passed.
Fix issue #25 (dates at the end of sentences)

Assets 2

16 Jun 08:45

tsproisl

v2.2.4

Improvements to tokenization of words containing numbers (e.g. COVID-19-Pandemie, FFP2-Maske).

Assets 2

02 Feb 10:40

tsproisl

v2.2.3

Improvements to tokenization: Roman ordinals, abbreviation “Art.” preceding a number, certain units of measurement at the end of a sentence (e.g. km/h).

Assets 2

12 Sep 17:52

tsproisl

v2.2.2

Bugfix: Command-line option --sentence_tag implies option --split_sentences.

Assets 2

08 Mar 08:57

tsproisl

v2.2.1

Bugfix: Command-line option --strip-tags implies option --xml.

Assets 2