Skip to content
This repository was archived by the owner on Mar 19, 2024. It is now read-only.
This repository was archived by the owner on Mar 19, 2024. It is now read-only.

Why fastText's algorithm gets worse when number of notes increases? #702

@ssoheilmn

Description

@ssoheilmn

I'm using this command to train a model:
$ ./fasttext skipgram -input train.txt -output model
Once the model is trained, I use a python script to find top 50 relevant words to a given word. When I train the model using default settings, I receive reasonable results if using only 10,000 documents for training. As I increase the number of documents to 100,000, 1,000,000, and 50,000,000, the results get worse and worse (suggested words get irrelevant).

I tried running the algorithm with the following changes, but still results get irrelevant as the corpus size grows:

  • Normalizing the input text by removing punctuations, case normalization, stopword removal, etc.
  • Setting -minn argument to 3 (min length of char ngram = 3)
  • Setting -maxn argument to 4 (max length of char ngram = 4)
  • Setting -minCount to 40 (minimal number of word occurrences = 40)
  • Setting -dim to 300 (size of word vectors = 300)

I know in the paper they reported that results do get slightly worse when the corpus size increases (Figure 1), but I'm wondering 1) why is this happening in general; 2) why is it happening in a very large scale for me; and 3) is there any remedy to this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    machine-learningissue/question to related general ML practice

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions