Why fastText's algorithm gets worse when number of notes increases?

I'm using this command to train a model:
`$ ./fasttext skipgram -input train.txt -output model
`
Once the model is trained, I use a python script to find top 50 relevant words to a given word. When I train the model using default settings, I receive reasonable results if using only 10,000 documents for training. As I increase the number of documents to 100,000, 1,000,000, and 50,000,000, the results get worse and worse (suggested words get irrelevant).

I tried running the algorithm with the following changes, but still results get irrelevant as the corpus size grows:
- Normalizing the input text by removing punctuations, case normalization, stopword removal, etc.
- Setting -minn argument to 3 (min length of char ngram = 3)
- Setting -maxn argument to 4 (max length of char ngram = 4)
- Setting -minCount to 40 (minimal number of word occurrences = 40)
- Setting -dim to 300 (size of word vectors = 300)

I know in the paper they reported that results do get slightly worse when the corpus size increases (Figure 1), but I'm wondering **1) why is this happening in general; 2) why is it happening in a very large scale for me; and 3) is there any remedy to this?**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why fastText's algorithm gets worse when number of notes increases? #702

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why fastText's algorithm gets worse when number of notes increases? #702

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions