Research default probability for adding missing characters to unigram tokenizers #348

Open

Open

Research default probability for adding missing characters to unigram tokenizers#348

Labels

pipeline 3: preprocessresearch

Current approach is assigning every new character a very low probability ( $log(p) = -18$ ).

Random distribution with mean matching the existing tokenizer and standard deviation 0
High probability relative to existing tokens

Metadata

Assignees

No one assigned

Labels

pipeline 3: preprocessresearch

Type

No type

Projects

SIL-NLP Research

Status

📋 Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests