-
Notifications
You must be signed in to change notification settings - Fork 137
How to add a new BERT tokenizer model
We assume the Bling Fire tools are already compiled and the PATH is set.
- Create a new directory under ldbsrc
cd ldbsrc mkdir bert_chinese
- Copy content of an existing model similar to yours into the new directory:
cp bert_base_tok/* bert_chinese
- Modify options.small to use new output name for your bin file:
OUTPUT = bert_chinese.bin
OUTPUT = bert_chinese.bin USE_CHARMAP = 1 opt_build_wbd = --dict-root=. --full-unicode opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia opt_pack_wbd_mmap = --alg=triv --type=mmap opt_pack_charmap = --alg=fixed --type=mmap --imp-mmap resources = \ <------>$(tmpdir)/wbd.fsa.$(mode).dump \ <------>$(tmpdir)/wbd.mmap.$(mode).dump \ <------>$(tmpdir)/charmap.mmap.$(mode).dump \
If you don't want to use character normalization such as case folding and accent removal, then you need to remove the charmap.utf8 compilation from options.small file and ldb.conf.small:
OUTPUT = bert_chinese_no_normalization.bin opt_build_wbd = --dict-root=. --full-unicode opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia opt_pack_wbd_mmap = --alg=triv --type=mmap resources = \ <------>$(tmpdir)/wbd.fsa.$(mode).dump \ <------>$(tmpdir)/wbd.mmap.$(mode).dump \
[wbd] max-depth 4 xword 2 seg 3 ignore 4 fsm 1 multi-map-mode triv-dump multi-map 2 # charmap 3
If you need normalization such as case folding, dropping of accents or something else. You can generate your own charmap.utf8 file. The format of the file is
Example:
# A --> a \x0041 \x0061 # B --> b \x0042 \x0062 # C --> c \x0043 \x0063 # D --> d \x0044 \x0064 # E --> e \x0045 \x0065 # F --> f \x0046 \x0066 # G --> g \x0047 \x0067 # H --> h \x0048 \x0068
It is easy to use a script to generate a charmap you need. For BERT casefolded models we use this command line:
python gen_charmap.py > charmap.utf8
After a charmap.utf8 is created you need to make sure options.small and ldb.conf contain options for compilation and resource reference for charmap.utf8 as before (see bert_base_tok directory.)