Skip to content

Research default probability for adding missing characters to unigram tokenizers #348

Open
@isaac091

Description

@isaac091

Current approach is assigning every new character a very low probability ( $log(p) = -18$ ).

  • Random distribution with mean matching the existing tokenizer and standard deviation 0
  • High probability relative to existing tokens

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions