This program takes an English-y word and returns the dictionary word(s) with most similar pronunciation. It catches a subset of misspellings invisible to traditional spellcheckers — namely those that “look different, but sound the same”.
python spellcheck.py
Enter ‘q’ to quit.
Enter a word: akshulie
Did you mean: actually
Enter a word: kuzz-zyn
Did you mean: cousin
Enter a word: zookeeny
Did you mean: zucchini
Enter a word: q
The idea is simple: map lexical sequences (words) to phonemic sequences (pronunciations), then look up the nearest phonemic sequence in the dictionary.
We map words to pronunciations with a sequence-to-sequence transformer trained on the CMU Pronouncing Dictionary. Sequences are compared with Levenshtein distance.
The following packages need to be installed:
nltk— for thewordscorpus.editdistance— fast C implementation of Levenshtein distance.torch— neural network library. Version 1.6.0 or higher.torchtext— used to process data. Note: theExampleandFieldclasses will soon be deprecated.
There is lots of room for experimentation and improvement.
- Sequence translation is greedy — could be improved with beam search.
- Edit distance search is exhaustive and likely too slow for some applications (see https://norvig.com/spell-correct.html for alternatives).
- Almost no hyperparameter search conducted while training the model. Doubling dimensionality (from 128 to 256) yields about 2% accuracy improvement but at the cost of quadrupling size. The bundled
seq2seq.ptfile is ~5 MB and >20 MB seemed excessive. - Vanilla transformer model — no fancy modifications.
- Syllabic information stripped away for simplicity — could be incorporated.
- Label smoothing may also improve accuracy.
To incorporate this work into a larger system, simply instantiate a Recommender and call recommend(), which takes a word and returns a list of matching words.
Disclaimer: Few checks on malformed input. Also, .py files not organized as importable modules.
This small program would not be possible without the decades-long work that has gone into the CMU Pronouncing Dictionary.
I would also like to thank Ben Trevett and Aladdin Persson, whose transformer-from-scratch code I copied extensively.