-
Notifications
You must be signed in to change notification settings - Fork 136
How to add a new BERT tokenizer model
SergeiAlonichau edited this page Aug 16, 2019
·
21 revisions
We assume the Bling Fire tools are already compiled and the PATH is set.
- Create a new directory under ldbsrc
cd ldbsrc mkdir bert_chinese
- Copy content of an existing model similar to yours into the new directory:
cp bert_base_tok/* bert_chinese
- Modify options.small to use new output name for your bin file:
`
OUTPUT = bert_chinese.bin
USE_CHARMAP = 1
opt_build_wbd = --dict-root=. --full-unicode
opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia opt_pack_wbd_mmap = --alg=triv --type=mmap opt_pack_charmap = --alg=fixed --type=mmap --imp-mmap
resources =
<------>$(tmpdir)/wbd.fsa.$(mode).dump
<------>$(tmpdir)/wbd.mmap.$(mode).dump
<------>$(tmpdir)/charmap.mmap.$(mode).dump
`
- If you don't want to use character normalization such as case folding and accent removal, then you need to remove the charmap.utf8 compilation