spell checking and correction is the one most NLP tasks we needed in our projects. from simple keyboard app to text dataset pre-processing, auto speech recognition, web/database search, OCR, language modeling, text normalization it is useful in many domains. there may be some hard-coded alghorithm based spell checking and correction library but they often works by comparing word character similarity. they may leak contextual underdtanding on natural language. on the other hand, this model may perform better spell correction task because the spell correction is performed by sentence level with language context understanding.
this is an open source spell correction language model based on transformer. you can benefit from context understanding functionality to improve correction performance.
we train this model with text corpus scraped from the internet and provide pre-trained model. you can fine-tune this model on your specific tasks by follow our train/fine-tune documentation.
here are some use case of this language model but not limited:
- Keyboard/Input method app. correct user typings.
- ASR (Automatic speech recognition). ASR models may perform much much worse without spell correction.
- OCR (Optical character recognition) without spell correction OCR systems may produce poor result especially hand written text recognition. B vs 8? I vs 1? l vs I? 2 vs Z? this model can help you correct the text by sentence overral context.
- Standlone spell correction applications.
- Dataset pre-processing tasks for other language modeling. e.g. Large language models. machine translation models even TTS Models.
- Search (web search or database search). we cannot search something in our database unless the key of data we want to search is match 1:1
- Content management system.
Anyone who needed. whatever open source or closed source, free or commericial, personal or organization anyone can use this model on their own projects.
we used the standard Encoder/Decoder transformer architecture from pytorch machine learning library.
- Encoder: we use character level tokenization for encoder layer to avoid oov (out of vocabulary) issue.
- Decoder: BPE (Byte pair encoding) tokenization to decoder layer for performance and fast training. we use decoder model as autoregressive model for better output quality.
- preparing the dataset. we scraped around 600MB raw text data from the internet and clean them.
- now we have large amount of sentences. we assume these of sentences are correct. and then we destruct 15%~20% of characters in each sentence by adding, modifying or remoring random characters inspired by google BERT or other masking language model.
- model learns recover the destructed, broken sentence. e.g. recover
غۇنچەم مەكتەپتىن قايتىپ كەلدىfromغسنچەھ مەكتزپتىن ق ايىپ كەلى - we can improve the model performance by augment the dataset by adding common spell mistakes. e.g. we often use
ىinstead ofېfor exampleئىيىق كەلدىinstead ofئېيىق كەلدىand use wrong characters of one ofئو، ئۇ، ئۆ، ئۈe.g.ئۇرۇقلاش مەھسۇلاتىinstead ofئورۇقلاش مەھسۇلاتى
- [TODO] Pre-train our model.
- [TODO] Publish our pre-trained model on HuggingFace.
- [TODO] make instructions to Train/Fine-tune on custom dataset or train from scratch.
- [TODO] Provide ONNX exported model file.
- [TODO] Standlone spell checking desktop/web app uses vulkan/metal/cuda/cpu/webGPU backends provided from ONNX runtime.
- install
uvpython package manager by runningpip install uv - install project dependencies by using uv command:
uv sync

