GitHub

Phonetic spellchecker

This program takes an English-y word and returns the dictionary word(s) with most similar pronunciation. It catches a subset of misspellings invisible to traditional spellcheckers — namely those that “look different, but sound the same”.

Usage

python spellcheck.py

Example run

Enter ‘q’ to quit.

Enter a word: akshulie
Did you mean: actually

Enter a word: kuzz-zyn
Did you mean: cousin

Enter a word: zookeeny
Did you mean: zucchini

Enter a word: q

How it works

The idea is simple: map lexical sequences (words) to phonemic sequences (pronunciations), then look up the nearest phonemic sequence in the dictionary.

We map words to pronunciations with a sequence-to-sequence transformer trained on the CMU Pronouncing Dictionary. Sequences are compared with Levenshtein distance.

Dependencies

The following packages need to be installed:

nltk — for the words corpus.
editdistance — fast C implementation of Levenshtein distance.
torch — neural network library. Version 1.6.0 or higher.
torchtext — used to process data. Note: the Example and Field classes will soon be deprecated.

Optimizations

There is lots of room for experimentation and improvement.

Sequence translation is greedy — could be improved with beam search.
Edit distance search is exhaustive and likely too slow for some applications (see https://norvig.com/spell-correct.html for alternatives).
Almost no hyperparameter search conducted while training the model. Doubling dimensionality (from 128 to 256) yields about 2% accuracy improvement but at the cost of quadrupling size. The bundled seq2seq.pt file is ~5 MB and >20 MB seemed excessive.
Vanilla transformer model — no fancy modifications.
Syllabic information stripped away for simplicity — could be incorporated.
Label smoothing may also improve accuracy.

For your own use

To incorporate this work into a larger system, simply instantiate a Recommender and call recommend(), which takes a word and returns a list of matching words.

Disclaimer: Few checks on malformed input. Also, .py files not organized as importable modules.

Acknowledgements

This small program would not be possible without the decades-long work that has gone into the CMU Pronouncing Dictionary.

I would also like to thank Ben Trevett and Aladdin Persson, whose transformer-from-scratch code I copied extensively.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
cmu_data.py		cmu_data.py
cmu_dict.txt		cmu_dict.txt
model_utils.py		model_utils.py
pronouncer.py		pronouncer.py
quick_test.py		quick_test.py
quick_test.txt		quick_test.txt
recommender.py		recommender.py
seq2seq.pt		seq2seq.pt
seq2seq.py		seq2seq.py
spellcheck.py		spellcheck.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Phonetic spellchecker

Usage

Example run

How it works

Dependencies

Optimizations

For your own use

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

makarr/spellcheck

Folders and files

Latest commit

History

Repository files navigation

Phonetic spellchecker

Usage

Example run

How it works

Dependencies

Optimizations

For your own use

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages