How to train better language detection with Bling Fire and FastText

In this small tutorial we illustrate the importance of good tokenization model for multi-lingual texts.

First thing first lets follow the steps here install FastText, download training data, create valid.txt , train.txt and train our baseline model.

~/fastext$ wc -l train.txt valid.txt 
  8379309 train.txt
    10000 valid.txt
  8389309 total

~/fastext$ fasttext supervised -input train.txt -output langdetect -dim 16

~/fastext$ fasttext test langdetect.bin valid.txt
N	10000
P@1	0.96
R@1	0.96

We got 96% Precision @1 (top most predicted class is 96% times correct).

As suggested in the tutorial let's enable character ngrams and train again.

~/fastext$ fasttext supervised -input train.txt -output langdetect24cgr -dim 16 -minn 2 -maxn 4

~/fastext$ fasttext test langdetect24cgr.bin valid.txt 
N	10000
P@1	0.985
R@1	0.985

We got 98.5% Precision @1.

Model size is like this:

~/fastext$ ls -lh langdetect.bin langdetect24cgr.bin 
-rw-rw-r-- 1 sergei sergei 403M Jul  7 18:20 langdetect24cgr.bin
-rw-rw-r-- 1 sergei sergei 281M Jul  7 17:51 langdetect.bin

Now let's add Bling Fire into the mix. Instead of training models from raw tokens, we will tokenize text with laser100k.bin model and use only IDs from Bling Fire for FastText training. The laser100k.bin is a Unigram LM tokenization model with 100K tokens learned from balanced by language plain text corpus. See https://github.com/microsoft/BlingFire/tree/master/ldbsrc/laser100k for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to train better language detection with Bling Fire and FastText

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally