You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Mar 19, 2024. It is now read-only.
I'm using this command to train a model: $ ./fasttext skipgram -input train.txt -output model
Once the model is trained, I use a python script to find top 50 relevant words to a given word. When I train the model using default settings, I receive reasonable results if using only 10,000 documents for training. As I increase the number of documents to 100,000, 1,000,000, and 50,000,000, the results get worse and worse (suggested words get irrelevant).
I tried running the algorithm with the following changes, but still results get irrelevant as the corpus size grows:
Normalizing the input text by removing punctuations, case normalization, stopword removal, etc.
Setting -minn argument to 3 (min length of char ngram = 3)
Setting -maxn argument to 4 (max length of char ngram = 4)
Setting -minCount to 40 (minimal number of word occurrences = 40)
Setting -dim to 300 (size of word vectors = 300)
I know in the paper they reported that results do get slightly worse when the corpus size increases (Figure 1), but I'm wondering 1) why is this happening in general; 2) why is it happening in a very large scale for me; and 3) is there any remedy to this?