RE2 does not support look-ahead

So, the C++ tokenizer generates a slightly different output than that of the HuggingFace tokenzer if the input text contains more than one successive whitespaces.

```bash
cmake --build /tmp/b
/tmp/b/bin/bpe_test > /tmp/c
python tool/t.py > /tmp/t
python tool/cmp.py /tmp/c /tmp/t /tmp/sample.txt
```

<img width="1399" alt="Screenshot 2023-02-10 at 12 18 02 PM" src="https://user-images.githubusercontent.com/1548775/218189519-851122fa-76a4-4f97-902e-d079f2f9b2e7.png">




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RE2 does not support look-ahead #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RE2 does not support look-ahead #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions