So, the C++ tokenizer generates a slightly different output than that of the HuggingFace tokenzer if the input text contains more than one successive whitespaces.
cmake --build /tmp/b
/tmp/b/bin/bpe_test > /tmp/c
python tool/t.py > /tmp/t
python tool/cmp.py /tmp/c /tmp/t /tmp/sample.txt
