How to add a new BERT tokenizer model

We assume the Bling Fire tools are already compiled and the PATH is set.

Initial Steps

Create a new directory under ldbsrc

cd ldbsrc mkdir bert_chinese

Copy content of an existing model similar to yours into the new directory:

cp bert_base_tok/* bert_chinese

Modify options.small to use new output name for your bin file:

OUTPUT = bert_chinese.bin

OUTPUT = bert_chinese.bin

USE_CHARMAP = 1

opt_build_wbd = --dict-root=. --full-unicode

opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia
opt_pack_wbd_mmap = --alg=triv --type=mmap
opt_pack_charmap = --alg=fixed --type=mmap --imp-mmap

resources = \
<------>$(tmpdir)/wbd.fsa.$(mode).dump \
<------>$(tmpdir)/wbd.mmap.$(mode).dump \
<------>$(tmpdir)/charmap.mmap.$(mode).dump \

Disable Normalization

If you don't want to use character normalization such as case folding and accent removal, then you need to remove the charmap.utf8 compilation from options.small file and ldb.conf.small:

options.small

OUTPUT = bert_chinese_no_normalization.bin

opt_build_wbd = --dict-root=. --full-unicode

opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia
opt_pack_wbd_mmap = --alg=triv --type=mmap

resources = \
<------>$(tmpdir)/wbd.fsa.$(mode).dump \
<------>$(tmpdir)/wbd.mmap.$(mode).dump \

ldb.conf.small

[wbd]
max-depth 4
xword 2
seg 3
ignore 4
fsm 1
multi-map-mode triv-dump
multi-map 2
# charmap 3

Enable Normalization

If you need normalization such as case folding, dropping of accents or something else. You can generate your own charmap.utf8 file. The format of the file is space . The is 0 or more length, usually 1. If there is no entry found for then it will remain unchanged. If length is 0 (empty string) then the will be deleted.

Example:

# A --> a
\x0041 \x0061

# B --> b
\x0042 \x0062

# C --> c
\x0043 \x0063

# D --> d
\x0044 \x0064

# E --> e
\x0045 \x0065

# F --> f
\x0046 \x0066

# G --> g
\x0047 \x0067

# H --> h
\x0048 \x0068

It is easy to use a script to generate a charmap you need. For BERT casefolded models we use this command line:

python gen_charmap.py > charmap.utf8

After a charmap.utf8 is created you need to make sure options.small and ldb.conf contain options for compilation and resource reference for charmap.utf8 as before (see bert_base_tok directory.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to add a new BERT tokenizer model

Initial Steps

Disable Normalization

options.small

ldb.conf.small

Enable Normalization

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally