-
Notifications
You must be signed in to change notification settings - Fork 76
Open
Labels
Description
When using case_markup in space/none mode, unexpected behavior happens:
>>> pyonmttok.Tokenizer("none", case_markup=True).tokenize("你好世界,这是一个Test。")
... (['⦅mrk_case_modifier_C⦆', '你好世界,这是一个test。'], None)
>>> pyonmttok.Tokenizer("none", case_markup=True).detokenize(['⦅mrk_case_modifier_C⦆', '你好世界,这是一个test。'])
... '你好世界,这是一个test。'As you can see, .detokenize can not rebuild the original text. Same behavior exists for space.
While mode conservative or aggressive does not suffer this issue. But the result compare to no case_markup is not consistent, as they split the text to insert markup placeholder.
>>> pyonmttok.Tokenizer("conservative").tokenize("你好世界,这是一个Test。")
... (['你好世界', ',', '这是一个Test', '。'], None)
>>> pyonmttok.Tokenizer("conservative", case_markup=True).tokenize("你好世界,这是一个Test。")
... (['你好世界', ',', '这是一个', '⦅mrk_case_modifier_C⦆', 'test', '。'], None)