-
Notifications
You must be signed in to change notification settings - Fork 11
llm tokenizers cxx 1
LLM Tokenizers in C++ (0): Port from Python
The first step of BPE tokenizing is to split the text into candidate tokens. The class GPT2Tokenizer uses Python’s regexp package to do so.
The following regexp defines a candidate token.
self.pat = re.compile(
r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
With the triple-quote, the authors do not have to escape the quote mark. The prefix r before the triple-quote means the string is treated as a raw string, so all other eescape codes will be ignored.
The general structure of this regexp is a sequence of sub-regexp’s separated by the logical-or mark, |. The first few sub-regexp’s correspond to commonly-used subwords such as 've.
\p{L}, according to this StackOverflow answer, mean a unicode letter. Similarly, \p{N} is a number. \s means a space character, which could be a space, a tab, or a newline. \S mean non-space character.
So, ?\p{L}+ means a sequence of one or more letters prefixed with or without a whitespace. This represents a “word”. Similarly, ?\p{N}+ represents a number without signs or the dot.
[^\s\p{L}\p{N}] means a character which is not a space, a letter, or a number. This leaves us the puctunations. Therefore, ?[^\s\p{L}\p{N}]+ is a sequence of successive punctuators prefixed with or without a whitespace.
\s+ means one or more spaces.
\s+(?!\S — what is this?
[std::regex](https://en.cppreference.com/w/cpp/regex) does not accept Unicode regexp \p{L} or \p{N}. If we do so, the progrma will abort with the
error_escape and the message
the expression contains an invalid escaped character or a trailing escape
This seems a known problem, and the workaround is boost::wregex. Unfortunately, Boost is not part of the iOS SDK and I do not want to bring it in as a dependency of my Xcode project. Moreover, according to this answer,
The Boost.Regex documentation explicitly states that there's no support for Unicode-specific character classes when using boost::wregex. If you want this functionality, you'll need to build Boost.Regex with ICU support enabled then use the boost::u32regex type instead of boost::wregex.
No! I do not want to build Boost from source code because it is huge.
I then remembered several years ago, I read Russ Cox’s great notes about doing regular expression the right way and his work RE2.
RE2 is small. It supports Unicode regexps. More importantly, I can build it using CMake for both my macOS and iOS (Simulator)! The following commands builds RE2 for the host.
cmake -B b -S . -DCMAKE_INSTALL_PREFIX=b/install
cmake —build b —target installThe following commands builds RE2 for iOS Simulator. You can change iphonesimulator into iphones to make it build for iOS devices.
cmake -B build-ios -S . \
-DCMAKE_SYSTEM_NAME=iOS \
-DCMAKE_OSX_SYSROOT="$(xcodebuild -version -sdk iphonesimulator Path)" \
-DCMAKE_OSX_DEPLOYMENT_TARGET=11.0 \
-DCMAKE_IOS_INSTALL_COMBINED=YES \
-DCMAKE_INSTALL_PREFIX=build-ios/install
cmake --build build-ios --target installThe following Python code is copy-n-pasted from the Transformers’ tokenzier repository.
import regex as re
pat = re.compile(
r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
def f(text):
for token in re.findall(pat, text):
print(f"token=\"{token}\"")
f("we'd see you say 世界你好真实好的很啊")The output is as the following.
token="we"
token="'d"
token=" "
token="see"
token=" "
token="you"
token=" say"
token=" 世界你好真实好的很啊"
The following is the corresponding C++ code. I removed the sub-regexp \s+(?!\S, because RE2 cannot parse it and I have no idea what it means.
#include <string>
#include <iostream>
#include <re2/re2.h>
#include <re2/stringpiece.h>
int main() {
std::string w;
std::string text = "we'd see you say 世界你好真实好的很啊";
re2::StringPiece input(text);
RE2 re("('s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+)");
assert(re.ok()); // compiled; if not, see re.error();
std::string var;
int value;
while (RE2::FindAndConsume(&input, re, &w)) {
std::cout << "token=\"" << w << "\"" << std::endl;
}
}The following command builds it and links the RE2 static libarary.
clang++ -std=c++20 b.cc \
-I ~/w/re2/b/install/include \
-L ~/w/re2/b/install/lib -lre2 -o bThe output is identical to the above from the Python code.