Skip to content

How to add a new BERT tokenizer model

SergeiAlonichau edited this page Aug 16, 2019 · 21 revisions

We assume the Bling Fire tools are already compiled and the PATH is set.

  1. Create a new directory under ldbsrc

cd ldbsrc mkdir bert_chinese

  1. Copy content of an existing model similar to yours into the new directory:

cp bert_base_tok/* bert_chinese

  1. Modify options.small to use new output name for your bin file:

`

Compilation options

OUTPUT = bert_chinese.bin

USE_CHARMAP = 1

opt_build_wbd = --dict-root=. --full-unicode

opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia opt_pack_wbd_mmap = --alg=triv --type=mmap opt_pack_charmap = --alg=fixed --type=mmap --imp-mmap

USE_TEST_WTBT_DICT = 1

resources =
<------>$(tmpdir)/wbd.fsa.$(mode).dump
<------>$(tmpdir)/wbd.mmap.$(mode).dump
<------>$(tmpdir)/charmap.mmap.$(mode).dump
`

  1. If you don't want to use character normalization such as case folding and accent removal, then you need to remove the charmap.utf8 compilation

Clone this wiki locally