llm tokenizers cxx 1

LLM Tokenziers in C++ (1): Unicode Regexp

LLM Tokenizers in C++ (0): Port from Python

Regexp for Unicode

The first step of BPE tokenizing is to split the text into candidate tokens. The class GPT2Tokenizer uses Python’s regexp package to do so.

The following regexp defines a candidate token.

self.pat = re.compile(
  r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

With the triple-quote, the authors do not have to escape the quote mark. The prefix r before the triple-quote means the string is treated as a raw string, so all other eescape codes will be ignored.

The general structure of this regexp is a sequence of sub-regexp’s separated by the logical-or mark, |. The first few sub-regexp’s correspond to commonly-used subwords such as 've.

\p{L}, according to this StackOverflow answer, mean a unicode letter. Similarly, \p{N} is a number. \s means a space character, which could be a space, a tab, or a newline. \S mean non-space character.

So, ?\p{L}+ means a sequence of one or more letters prefixed with or without a whitespace. This represents a “word”. Similarly, ?\p{N}+ represents a number without signs or the dot.

[^\s\p{L}\p{N}] means a character which is not a space, a letter, or a number. This leaves us the puctunations. Therefore, ?[^\s\p{L}\p{N}]+ is a sequence of successive punctuators prefixed with or without a whitespace.

\s+ means one or more spaces.

\s+(?!\S — what is this?

Unicode Regexp in C++

[std::regex](https://en.cppreference.com/w/cpp/regex) does not accept Unicode regexp \p{L} or \p{N}. If we do so, the progrma will abort with the error_escape and the message

the expression contains an invalid escaped character or a trailing escape

This seems a known problem, and the workaround is boost::wregex. Unfortunately, Boost is not part of the iOS SDK and I do not want to bring it in as a dependency of my Xcode project. Moreover, according to this answer,

The Boost.Regex documentation explicitly states that there's no support for Unicode-specific character classes when using boost::wregex. If you want this functionality, you'll need to build Boost.Regex with ICU support enabled then use the boost::u32regex type instead of boost::wregex.

No! I do not want to build Boost from source code because it is huge.

I then remembered several years ago, I read Russ Cox’s great notes about doing regular expression the right way and his work RE2.

RE2 is small. It supports Unicode regexps. More importantly, I can build it using CMake for both my macOS and iOS (Simulator)! The following commands builds RE2 for the host.

cmake -B b -S . -DCMAKE_INSTALL_PREFIX=b/install
cmake —build b —target install

The following commands builds RE2 for iOS Simulator. You can change iphonesimulator into iphones to make it build for iOS devices.

cmake -B build-ios -S . \
      -DCMAKE_SYSTEM_NAME=iOS \
      -DCMAKE_OSX_SYSROOT="$(xcodebuild -version -sdk iphonesimulator  Path)" \
      -DCMAKE_OSX_DEPLOYMENT_TARGET=11.0 \
      -DCMAKE_IOS_INSTALL_COMBINED=YES \
      -DCMAKE_INSTALL_PREFIX=build-ios/install

cmake --build build-ios --target install

Side-by-Side

The following Python code is copy-n-pasted from the Transformers’ tokenzier repository.

import regex as re

pat = re.compile(
  r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

def f(text):
  for token in re.findall(pat, text):
    print(f"token=\"{token}\"")

f("we'd  see   you say 世界你好真实好的很啊")

The output is as the following.

token="we"
token="'d"
token="  "
token="see"
token="   "
token="you"
token=" say"
token=" 世界你好真实好的很啊"

The following is the corresponding C++ code. I removed the sub-regexp \s+(?!\S, because RE2 cannot parse it and I have no idea what it means.

#include <string>
#include <iostream>
#include <re2/re2.h>
#include <re2/stringpiece.h>

int main() {
  std::string w;
  std::string text = "we'd  see   you say 世界你好真实好的很啊";
  re2::StringPiece input(text);

  RE2 re("('s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+)");
  assert(re.ok());  // compiled; if not, see re.error();

  std::string var;
  int value;
  while (RE2::FindAndConsume(&input, re, &w)) {
    std::cout << "token=\"" << w << "\"" << std::endl;
  }
}

The following command builds it and links the RE2 static libarary.

clang++ -std=c++20 b.cc \
 -I ~/w/re2/b/install/include \
 -L ~/w/re2/b/install/lib -lre2 -o b

The output is identical to the above from the Python code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llm tokenizers cxx 1

LLM Tokenziers in C++ (1): Unicode Regexp

Regexp for Unicode

Unicode Regexp in C++

Side-by-Side

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally