-
Notifications
You must be signed in to change notification settings - Fork 136
How to train better language detection with Bling Fire and FastText
In this small tutorial we illustrate the importance of good tokenization model for multi-lingual texts.
First thing first lets follow the steps here install FastText, download training data, create valid.txt , train.txt and train our baseline model.
~/fastext$ wc -l train.txt valid.txt
8379309 train.txt
10000 valid.txt
8389309 total
~/fastext$ fasttext supervised -input train.txt -output langdetect -dim 16
~/fastext$ fasttext test langdetect.bin valid.txt
N 10000
P@1 0.96
R@1 0.96
We got 96% Precision @1 (top most predicted class is 96% times correct).
As suggested in the tutorial let's enable character ngrams and train again.
~/fastext$ fasttext supervised -input train.txt -output langdetect24cgr -dim 16 -minn 2 -maxn 4
~/fastext$ fasttext test langdetect24cgr.bin valid.txt
N 10000
P@1 0.985
R@1 0.985
We got 98.5% Precision @1.
Model size is like this:
~/fastext$ ls -lh langdetect.bin langdetect24cgr.bin
-rw-rw-r-- 1 sergei sergei 403M Jul 7 18:20 langdetect24cgr.bin
-rw-rw-r-- 1 sergei sergei 281M Jul 7 17:51 langdetect.bin
Now let's add Bling Fire into the mix. Instead of training models from raw tokens, we will tokenize text with laser100k.bin model and use only IDs from Bling Fire for FastText training. The laser100k.bin is a Unigram LM tokenization model with 100K tokens learned from balanced by language plain text corpus. See https://github.com/microsoft/BlingFire/tree/master/ldbsrc/laser100k for details.