Skip to content

RE2 does not support look-ahead #10

@wangkuiyi

Description

@wangkuiyi

So, the C++ tokenizer generates a slightly different output than that of the HuggingFace tokenzer if the input text contains more than one successive whitespaces.

cmake --build /tmp/b
/tmp/b/bin/bpe_test > /tmp/c
python tool/t.py > /tmp/t
python tool/cmp.py /tmp/c /tmp/t /tmp/sample.txt

Screenshot 2023-02-10 at 12 18 02 PM

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions