Myanmar (Burmese) text intelligence library — 14-strategy checking pipeline, dictionary building, and AI model training, from O(1) SymSpell lookups to ONNX-powered inference.
mySpellChecker is a comprehensive text intelligence library built specifically for the Myanmar language. It covers three domains: a 14-strategy checking pipeline (from rule-based validation through grammar checking, N-gram context, hidden-compound and syllable-window detection, confusable and homophone detection, to ONNX-powered AI inference), a dictionary building pipeline (corpus ingestion, segmentation, N-gram frequency, SQLite packaging), and AI model training (semantic MLM fine-tuning with ONNX export). Since Myanmar script is written as a continuous stream without spaces between words, the library uses a multi-layer validation approach — starting with fast syllable-level checks and progressively applying deeper analysis including POS tagging, 8 grammar checkers, and context-aware semantic validation.
Note: v1.0 supports Standard Burmese (Myanmar) only. Other Myanmar-script languages (Shan, Karen, Mon, etc.) and extended Unicode ranges are planned for future releases.
- 14-Strategy Validation Pipeline: Composable strategies from fast rule checks (sub-10ms) to AI inference, each layer building on the previous.
- Syllable-First Architecture: Validates most errors at the syllable level before assembling into words for deeper analysis.
- SymSpell Algorithm: Custom O(1) symmetric delete implementation with Myanmar-specific variant generation for fast correction suggestions.
- N-gram Context Checking: Bigram/Trigram probabilities detect real-word errors (correct spelling, wrong context).
- Hidden Compound Detection: Recovers multi-token compound typos that the segmenter over-splits into individually-valid syllables (e.g.
ခုန်ကျစရိတ်→ကုန်ကျစရိတ်). Walks curated-vocabulary bigram/trigram windows and verifies against high-frequency dictionary compounds. - Syllable-Window OOV Detection (opt-in): Complementary structural-phase detector that enumerates syllable-windows across adjacent words and consults SymSpell for high-frequency near-matches.
- Homophone Detection: Bidirectional N-gram analysis catches sound-alike word errors with frequency-aware guards.
- Confusable Detection: Multi-layer valid-word confusion detection — statistical bigram, MLP classifier, and MLM semantic analysis.
- Meta-Classifier Post-Filter: Logistic regression model (41 features) replaces manual per-strategy confidence thresholds; compound-aware features in the v2 model drive FPR reduction.
- Grammar Checking: 8 specialized checkers — Aspect, Classifier, Compound, MergedWord, Negation, Particle, TenseAgreement, Register.
- POS Tagging: Pluggable backends — Rule-Based (fast), Viterbi HMM (balanced), Transformer (93% accuracy).
- Joint Segmentation: Simultaneous word segmentation and POS tagging in a single pass.
- Suffix-Aware Re-Segmentation: DefaultSegmenter post-processes oversized tokens and colloquial-locative merges (e.g.
ရန်ကုန်မာ→[ရန်, ကုန်, မာ]) for cleaner downstream validation. - Compound & Morpheme Handling: DP-based compound resolution, ternary compound splits in morpheme correction, productive reduplication validation.
- AI Semantic Checking (Optional): ONNX masked language model for context-aware validation.
- Named Entity Recognition: Heuristic and Transformer-based NER to reduce false positives on names and places.
- Multi-Format Corpus Ingestion: Build dictionaries from
.txt,.csv,.tsv,.json,.jsonl,.parquetfiles. - Incremental Builds: Resume corpus processing without reprocessing completed files.
- Pluggable Storage: SQLite (default, disk-based) or MemoryProvider (RAM-based) with thread-safe connection pooling.
- Semantic Model Training: Train masked language models with word-boundary BPE, whole-word masking, and denoising objectives.
- ONNX Export & Quantization: Convert trained models to ONNX with quantization for production deployment.
- Text Normalization: Unified service — zero-width character removal, NFC/NFD normalization, Zawgyi conversion.
- Zawgyi Detection: Built-in detection and warning for legacy Zawgyi encoded text.
- Phonetic & Colloquial Handling: Phonetic hashing, colloquial variant detection (e.g., ကျနော် → ကျွန်တော်), configurable strictness.
- Tone Processing: Tone mark validation, disambiguation, and context-based correction.
- Bilingual Error Messages: Error reporting in English and Myanmar (Burmese).
- Cython/C++ Extensions: 11 performance-critical paths compiled to C++ with OpenMP parallelization.
- Streaming & Batch APIs: Process large documents with streaming, batch (
check_batch), and async (check_async) APIs. - Configurable: Pre-defined profiles (production, fast, accurate, development, testing), environment/file-based config loading, and DI container for advanced wiring.
Full documentation is available at docs.myspellchecker.com.
What's new in v1.6.0? See the Release Notes for new validation strategies (mined-confusable, pre-segmenter raw probe), the compound-split confusable boost, the skip-rule confidence gate, consonant-gated Tall-AA normalization, the flat-AA dictionary migration, and spelling-first benchmark labeling.
- Introduction: Overview of the library and its architecture.
- Installation: Installation options and system requirements.
- Quick Start: Get up and running in 5 minutes.
- Configuration Guide: All configuration options and profiles.
- Overview: 14-strategy text checking pipeline.
- Syllable Validation: Core validation layer.
- Word Validation: Dictionary + SymSpell suggestions.
- Context Checking: N-gram probability analysis.
- Hidden Compound Detection: Recover compound typos hidden by segmenter over-splitting.
- Syllable-Window OOV: Multi-syllable OOV detection via SymSpell windows (opt-in).
- Confusable Detection: Multi-layer confusable word detection.
- Homophone Detection: Sound-alike error detection.
- Grammar Checking: Syntactic validation.
- Grammar Checkers: 8 specialized checkers.
- Grammar Engine: Rule engine internals.
- Named Entity Recognition: NER with 3 implementations + gazetteer.
- Loan Word Variants: Transliteration variant handling for English, Pali/Sanskrit loan words.
- POS Tagging: Pluggable tagging (Rule-Based, Viterbi, Transformer).
- Morphology Analysis: Word structure analysis.
- Compound Resolution: Compound word and reduplication validation.
- Segmenters: Word segmentation engines.
- Semantic Checking: AI-powered MLM validation.
- Validation Strategies: 14 composable strategies.
- Training Models: Train custom semantic models.
- Text Normalization: Unified normalization service.
- Text Utilities: Stemmer, Phonetic, Tone, Zawgyi.
- Text Validation: Input text validation.
- Streaming: Large document processing.
- Batch Processing: High-throughput parallel processing.
- Async API: Non-blocking spell check operations.
- Performance Tuning: Optimization strategies.
- Connection Pooling: Database connection management.
- Customization Guide: Extending and customizing behavior.
- Custom Dictionaries: Build and customize dictionaries.
- Custom Grammar Rules: Write YAML grammar rules.
- Caching: Algorithm and result caching.
- Resource Caching: Model and resource caching.
- Logging: Centralized logging configuration.
- Integration Guide: Integrate with web apps and APIs.
- Docker: Container deployment guide.
- Zawgyi Support: Legacy encoding handling.
- Pipeline Overview: Dictionary building pipeline.
- Corpus Format: Supported input formats.
- Ingestion: Corpus ingestion details.
- Building Dictionaries: Step-by-step build guide.
- Optimization: Performance tuning for large corpora.
- API Reference: Full API documentation.
- SpellChecker API: Main SpellChecker class reference.
- Configuration API: Configuration class reference.
- Provider Capabilities: Dictionary provider interface.
- Tokenizers: Tokenizer API reference.
- CLI Reference: Command-line interface guide.
- Core Overview: Core package internals.
- Syllable Validation: Syllable validator internals.
- Word Validation: Word validator internals.
- Training Internals: ML training pipeline internals.
- Algorithm Factory: Algorithm instantiation patterns.
- I/O Utilities: File I/O utilities reference.
- Algorithms Overview: Algorithm catalog.
- SymSpell: O(1) suggestion algorithm.
- Edit Distance: Myanmar-aware Levenshtein distance.
- Suggestion Ranking: Multi-signal ranking pipeline.
- Neural Reranker: ONNX-based MLP/GBT suggestion reranker.
- Suggestion Strategy: Strategy pattern for suggestions.
- Morpheme Suggestions: Morpheme-level and medial swap corrections.
- N-gram Context: Bigram/Trigram probability models.
- Context-Aware Checking: N-gram and syntactic rules.
- Semantic Algorithm: AI/ML inference internals.
- Grammar Rules Engine: Grammar rule processing.
- Tone Disambiguation: Tone mark resolution.
- NER Algorithm: NER implementation details.
- Segmentation Overview: Segmentation algorithm catalog.
- Syllable Segmentation: Syllable-level segmentation.
- Normalization Algorithm: Text normalization internals.
- Phonetic Algorithm: Phonetic hashing and similarity.
- Viterbi POS Tagger: HMM-based POS tagging.
- POS Disambiguator: POS disambiguation logic.
- Joint Segmentation: Combined segmentation + tagging.
- Architecture Overview: Multi-layer validation pipeline.
- System Design: Component architecture.
- Validation Pipeline: Pipeline execution flow.
- Component Diagram: Visual component map.
- Data Flow: Data flow through the system.
- Dependency Injection: Component management system.
- Extension Points: How to extend the library.
- Reference Overview: Technical reference index.
- Error Types: Error classification reference.
- Error Codes: Complete error code listing.
- Rules System: YAML configuration files.
- Constants: Myanmar Unicode constants and character sets.
- Glossary: Terms and definitions.
- Phonetic Data: Phonetic groups and similarity mappings.
- Pipeline Core: Data pipeline core module.
- Database Schema: SQLite schema reference.
- Schema Management: Schema versioning and migrations.
- Providers: Data source providers.
- Processing: Text processing stages.
- POS Inference: POS tagging during build.
- Segmentation Repair: Segmentation error correction.
- Pipeline Reporter: Build progress reporting.
- FAQ: Frequently asked questions.
- Troubleshooting: Common issues and solutions.
- Comparisons: How mySpellChecker compares to other tools.
- Development Guide: Development overview.
- Setup: Development environment setup.
- Contributing: Contribution guidelines.
- Naming Conventions: Code naming standards.
- Testing: Test suite and coverage.
- Benchmarks: Benchmark suite and scoring methodology.
- Cython Dev Guide: Working with Cython extensions.
- Cython Reference: Cython patterns and optimization.
- CLI Formatting: CLI output formatting internals.
Prerequisites:
- Python 3.10+
- C++ Compiler (GCC/Clang/MSVC) for building Cython extensions.
Standard (Recommended):
pip install myspellcheckerWith Transformer POS Tagging (Optional):
# Enables transformer-based POS tagging for 93% accuracy
pip install "myspellchecker[transformers]"Full (with all features):
pip install "myspellchecker[ai,build,train,transformers]"The library requires a dictionary database. You can build a sample one or use your own corpus.
# Install build dependencies (pyarrow, duckdb, etc.)
pip install "myspellchecker[build]"
# Build a sample database for testing
myspellchecker build --sample
# Build from your own text corpus
myspellchecker build --input corpus.txt --output mySpellChecker.dbPython:
from myspellchecker.core import SpellCheckerBuilder, ConfigPresets, ValidationLevel
# 1. Initialize with Builder (Recommended)
checker = (
SpellCheckerBuilder()
.with_config(ConfigPresets.DEFAULT)
.with_phonetic(True)
.build()
)
# 2. Simple Syllable Check (Fastest)
text = "မြနမ်ာနိုင်ငံ"
result = checker.check(text)
print(f"Corrected: {result.corrected_text}")
# Output: မြန်မာနိုင်ငံ
# 3. Context-Aware Check (Slower, more accurate)
# Detects that 'နီ' (Red) is wrong in this context, suggests 'နေ' (Stay/Ing)
text = "မင်းဘာလုပ်နီလဲ"
result = checker.check(text, level=ValidationLevel.WORD)
print(f"Corrected: {result.corrected_text}")
# Output: မင်းဘာလုပ်နေလဲCLI:
See the CLI Reference for full details.
# Check a string
echo "မင်္ဂလာပါ" | myspellchecker
# Check a file with rich output
myspellchecker check input.txt --format rich
# Segment text with POS tags
echo "မြန်မာနိုင်ငံ" | myspellchecker segment --tag
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers.sqlite import SQLiteProvider
# Configure with custom settings
config = SpellCheckerConfig(
max_edit_distance=2,
max_suggestions=5,
use_context_checker=True,
use_phonetic=True,
use_ner=True
)
checker = SpellChecker(
config=config,
provider=SQLiteProvider(database_path="mySpellChecker.db")
)See the Configuration Guide for all options.
Configure logging globally at the start of your application:
from myspellchecker.utils.logging_utils import configure_logging
# Enable verbose debug logging
configure_logging(level="DEBUG")
# Or use structured JSON logging for production
configure_logging(level="INFO", json_output=True)See the Logging Guide for details.
Eight specialized grammar checkers for Myanmar:
from myspellchecker.grammar.checkers.register import RegisterChecker
checker = RegisterChecker()
errors = checker.validate_sequence(["သူ", "သည်", "စာအုပ်", "ဖတ်", "တယ်"])
# Detects mixed register (formal "သည်" + colloquial "တယ်")See the Grammar Checkers Guide for details.
Reduce false positives by identifying names and places:
from myspellchecker.core.config import SpellCheckerConfig
config = SpellCheckerConfig(use_ner=True)See the NER Guide for details.
mySpellChecker supports advanced linguistic analysis:
- Pluggable POS Tagging: Rule-Based (fastest), Viterbi (balanced), or Transformer (most accurate).
- Joint Segmentation: Combine word breaking and tagging in a single pass.
from myspellchecker.core.config import SpellCheckerConfig, JointConfig
config = SpellCheckerConfig(
joint=JointConfig(enabled=True)
)
checker = SpellChecker(config=config)
words, tags = checker.segment_and_tag("မြန်မာနိုင်ငံ")See the POS Tagging Guide for details.
Composable validation pipeline with 14 strategies:
| Strategy | Priority | Purpose |
|---|---|---|
| ToneValidation | 10 | Tone mark disambiguation |
| Orthography | 15 | Medial order and compatibility |
| SyntacticRule | 20 | Grammar rule checking |
| SyllableWindowOOV | 22 | Multi-syllable OOV detection via SymSpell windows (opt-in) |
| HiddenCompoundTypo | 23 | Compound typos hidden by segmenter over-splitting |
| StatisticalConfusable | 24 | Bigram-based confusable detection |
| BrokenCompound | 25 | Broken compound word detection |
| POSSequence | 30 | POS sequence validation |
| Question | 40 | Question structure |
| Homophone | 45 | Sound-alike detection |
| ConfusableCompoundClassifier | 47 | MLP-based confusable/compound detection |
| ConfusableSemantic | 48 | MLM-enhanced confusable detection |
| NgramContext | 50 | N-gram probability |
| Semantic | 70 | AI-powered validation (ONNX) |
See the Validation Strategies Guide for details.
Tested on a 1,304-sentence benchmark suite (641 clean, 663 with errors, 670 in-scope error spans; scope=spelling excludes 131 out-of-scope annotations) covering 3 difficulty tiers and 7 domains. The dictionary database and semantic model are not bundled with the library — users build or provide their own.
Test environment:
- Dictionary: Production SQLite database (577 MB, 601K words, 2.2M bigrams, enrichment tables)
- Semantic model: Custom RoBERTa MLM (6L/768H, ONNX quantized, 71 MB)
- Hardware: Apple Silicon, Python 3.14
- Benchmark:
benchmarks/myspellchecker_benchmark.yaml(1,304 sentences)
| Metric | Value | vs v1.4.0 |
|---|---|---|
| F1 Score | 77.1% | +6.0 pts |
| Precision | 82.6% | +8.5 pts |
| Recall | 72.2% | +4.0 pts |
| FPR (clean sentences) | 10.8% | −7.8 pts |
| Top-1 Suggestion Accuracy | 70.5% | +0.8 pts |
| MRR | 0.7569 | +0.010 |
| p95 latency | 409ms | — |
For environments that don't ship the semantic model, the structural/contextual pipeline alone gives strong results at much lower latency:
| Metric | Value |
|---|---|
| F1 Score | 75.6% |
| Precision | 81.2% |
| Recall | 70.8% |
| FPR (clean sentences) | 11.1% |
| Top-1 Suggestion Accuracy | 73.2% |
| MRR | 0.7794 |
| p95 latency | 97ms |
| Composite score | 0.7899 |
The benchmark covers 14 validation strategies across conversational, news, technical, academic, religious, literary, and general domains with sentences ranging from simple syllable errors to hard context-dependent confusables.
git clone https://github.com/thettwe/myspellchecker.git
cd myspellchecker
python -m venv venv
source venv/bin/activate
pip install -e ".[dev]"The test suite has 4,940 tests with 75% code coverage, organized into unit, integration, e2e, and stress tiers with auto-applied pytest markers.
# Run default test suite (~5 min, skips slow tests)
pytest tests/
# Run by category
pytest tests/ -m integration # 307 integration tests
pytest tests/ -m e2e # 10 end-to-end CLI tests
pytest tests/ -m slow # 39 slow tests (property-based, stress, DB builds)
# Run with coverage
pytest tests/ --cov=src/myspellchecker --cov-fail-under=65
# Formatting and linting
ruff format .
ruff check .
mypy src/myspellcheckerSee the Development Guide for contributing guidelines and the Testing Guide for test suite details.
mySpellChecker integrates tools and research from the Myanmar NLP community:
| Resource | Author | Description | Link |
|---|---|---|---|
| Myanmar POS Model | Chuu Htet Naing | XLM-RoBERTa-based POS tagger (93.37% accuracy) | HuggingFace |
| Myanmar NER Model | Chuu Htet Naing | Transformer-based named entity recognition | HuggingFace |
| Myanmar Text Segmentation Model | Chuu Htet Naing | Transformer-based word segmenter | HuggingFace |
| myWord Segmentation | Ye Kyaw Thu | Viterbi-based Myanmar word segmentation | GitHub |
| myPOS | Ye Kyaw Thu | POS corpus used for CRF training | GitHub |
| myNER | Ye Kyaw Thu et al. | NER corpus with 7-tag annotation scheme, joint POS training | arXiv |
| myG2P | Ye Kyaw Thu | Myanmar grapheme-to-phoneme conversion dictionary | GitHub |
| CRF Word Segmenter | Ye Kyaw Thu | CRF-based syllable-to-word segmentation model | GitHub |
| myanmartools | Zawgyi detection and conversion | GitHub |
| Library | Purpose | License |
|---|---|---|
| pycrfsuite | CRF model inference | MIT |
| transformers | Transformer model inference | Apache 2.0 |
| Algorithm | Author | Description | Link |
|---|---|---|---|
| SymSpell | Wolf Garbe | Symmetric delete spelling correction algorithm. mySpellChecker includes a custom implementation with Myanmar-specific variant generation. | GitHub |
| SymSpell4Burmese | Hlaing Myat Nwe et al. | Foundational research on adapting SymSpell for Burmese | IEEE |
If you use mySpellChecker in your research, please cite the relevant works:
@misc{chuuhtetnaing-myanmar-pos,
author = {Chuu Htet Naing},
title = {Myanmar POS Model},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/chuuhtetnaing/myanmar-pos-model}
}
@misc{yekyawthu-myword,
author = {Ye Kyaw Thu},
title = {myWord: Word Segmentation Tool for Burmese},
year = {2017},
publisher = {GitHub},
url = {https://github.com/ye-kyaw-thu/myWord}
}
@misc{garbe-symspell,
author = {Wolf Garbe},
title = {SymSpell: Symmetric Delete Spelling Correction Algorithm},
year = {2012},
publisher = {GitHub},
url = {https://github.com/wolfgarbe/SymSpell}
}
@inproceedings{symspell4burmese,
title = {SymSpell4Burmese: Symmetric Delete Spelling Correction Algorithm for Burmese},
author = {Hlaing Myat Nwe and others},
year = {2021},
booktitle = {IEEE Conference},
url = {https://ieeexplore.ieee.org/document/9678171/}
}
@misc{yekyawthu-mypos,
author = {Ye Kyaw Thu},
title = {myPOS: POS Corpus for Myanmar Language},
publisher = {GitHub},
url = {https://github.com/ye-kyaw-thu/myPOS}
}
@misc{chuuhtetnaing-myanmar-segmentation,
author = {Chuu Htet Naing},
title = {Myanmar Text Segmentation Model},
publisher = {Hugging Face},
url = {https://huggingface.co/chuuhtetnaing/myanmar-text-segmentation-model}
}
@misc{chuuhtetnaing-myanmar-ner,
author = {Chuu Htet Naing},
title = {Myanmar NER Model},
publisher = {Hugging Face},
url = {https://huggingface.co/chuuhtetnaing/myanmar-ner-model}
}
@inproceedings{myner-2025,
title = {myNER: Contextualized Burmese Named Entity Recognition with Bidirectional LSTM and fastText Embeddings via Joint Training with POS Tagging},
author = {Kaung Lwin Thant and Kwankamol Nongpong and Ye Kyaw Thu and Thura Aung and Khaing Hsu Wai and Thazin Myint Oo},
year = {2025},
booktitle = {4th International Conference on Cybernetics and Innovations (ICCI 2025)},
note = {Best Presentation Award},
url = {https://arxiv.org/abs/2504.04038}
}
@misc{yekyawthu-myg2p,
author = {Ye Kyaw Thu},
title = {myG2P: Myanmar Grapheme to Phoneme Conversion Dictionary},
publisher = {GitHub},
url = {https://github.com/ye-kyaw-thu/myG2P}
}Thanks to these researchers and developers for making their work publicly available, enabling high-quality Myanmar language processing.
This project is licensed under the MIT License.