A comprehensive collection of Bangla NLP datasets for researchers and developers
π OUR DATASET IS IN LFS MODE! SO YOU HAVE TO CLONE IT FOR GETTING DATA!
π WE WILL SOON UPLOAD ALL DEEP LEARNING BASED DATASETS!
- π About
- π― sbnltk Dataset List
- π€ Pre-trained Language Models
- π Research Papers
- π§ Modern NLP Tools and Libraries
- π Benchmarking and Evaluation
- π Existing Datasets
- π° News Articles and Documents
- π€ Speech to Text / Text to Speech
- π Sentiment Analysis / Sentence Classification
- π Bangla Machine Translation Dataset
- π·οΈ Bangla POS Tag Dataset
- π·οΈ Bangla NER Dataset
- β Question Answering Dataset
- π Bangla Text Summarization
- π΅οΈ Bangla Fake News Detection
- ποΈ Handwriting Recognition / OCR
- π§ Miscellaneous
- π‘ Motivation
- π€ Usage and Contribute
Bangla NLP dataset repository containing sbnltk datasets, which were used in Bangla nlp toolkit - sbnltk.
This repository also serves as a comprehensive collection of existing Bangla NLP datasets created by the amazing Bangla NLP research community.
| Dataset | Description | Link |
|---|---|---|
| Number List | Bangla Number List | π₯ Download |
| Root Word List | Bangla root word List | π₯ Download |
| Word List | Bangla Word List (highest to lowest occurrence) | π₯ Download |
| Wiki Dump | Bangla Wiki Dump word | π₯ Download |
| POS Tag Static | Bangla POStag static dataset (single word) | π₯ Download |
| NER Static | Bangla NER Static Dataset (single word) | π₯ Download |
| Stop Words | Bangla Stop word list | π₯ Download |
| Dump POS Tag | Bangla Dump Pos tag | π₯ Download |
| Question Classification | Bangla Dump question Classification Dataset | π₯ Download |
| Sentiment Analysis | Bangla Dump Sentiment Analysis | π₯ Download |
| Translation Dataset | Google Translation Dataset | π₯ Download |
| NER Enhanced | NER Existing Dataset (Modified + adding Date entity) | π₯ Download |
| News Articles | News Article Dataset | π₯ Download |
| POS Converted | POS tag converted Data | π₯ Download |
| POS Human Evaluated | POS tag human evaluated Data | π₯ Download |
| NER Dump (Both) | DUMP NER data (active and passive both) | π₯ Download |
| NER Dump (Active) | DUMP NER data (active only) | π₯ Download |
| Extractive Summarization | Extractive Text Summarization | π GitHub |
| Abstractive Summarization | Abstractive Text Summarization (newspaper) | π₯ Drive | π Kaggle |
| Text Classification | News Article Classification (text Classification) | π₯ Drive | π Kaggle |
| Keywords Classification | Topic Keywords classification (keywords generator) | π₯ Drive | π Kaggle |
| Model | Description | Parameters | Link |
|---|---|---|---|
| BanglaBERT | ELECTRA-based model, state-of-the-art Bangla NLU | 110M | π€ HuggingFace |
| BanglishBERT | Bilingual (Bangla+English) BERT | 110M | π€ HuggingFace |
| BanglaBERT (Small) | Lightweight version for resource-constrained environments | 13M | π€ HuggingFace |
| BanglaBERT (Large) | Large variant with enhanced performance | 335M | π€ HuggingFace |
| Bangla BERT Base | Another popular BERT implementation | 110M | π€ HuggingFace |
| Bangla Electra | ELECTRA-based model for Bangla | 13.5M | π€ HuggingFace |
| Model | Description | Parameters | Link |
|---|---|---|---|
| BanglaT5 | T5-based sequence-to-sequence model | 247M | π€ HuggingFace |
| BanglaByT5 | Byte-level T5 model for Bangla | Small | π Research Paper |
| TituLLMs | Family of Bangla LLMs (1B & 3B) | 1B/3B | π Research Paper |
| TigerLLM | Bangla Large Language Models family | Various | π Research Paper |
| GPT2-Bangla | GPT-2 adapted for Bangla text generation | 117M | π€ HuggingFace |
| BanglaNLG | Natural language generation for Bangla | Various | π€ HuggingFace |
| Model | Description | Performance | Link |
|---|---|---|---|
| Wav2Vec2-Bangla-300M | Self-supervised speech recognition | 17.8% WER | π€ HuggingFace |
| Whisper-Bangla | OpenAI Whisper fine-tuned for Bangla | Various sizes | π€ HuggingFace |
| BanglaASR | Fine-tuned ASR model | 14.73% WER | π GitHub |
| Model | Description | Languages | Link |
|---|---|---|---|
| MuRIL | Google's multilingual model with Bangla support | 17 Indian | π€ HuggingFace |
| IndicBERT | BERT for Indian languages including Bangla | 12 Indian | π€ HuggingFace |
| sahajBERT | ALBERT-based model for Bangla | 18M | π€ HuggingFace |
- BanglaAutoKG: Automatic Bangla Knowledge Graph Construction with Semantic Neural Graph Filtering - π LREC-COLING 2024 | π» Code
- First framework for automatic Bangla KG construction using multilingual LLMs
- GNN-based semantic filtering for improved accuracy
- Bangladesh Agricultural Knowledge Graph: Enabling Semantic Integration and Data-driven Analysis - π IEEE Access 2024
- FAIR-compliant agricultural knowledge graph for sustainable farming
- BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization - π arXiv 2024
- First end-to-end pipeline for Bangla dialect standardization
- Achieved 0.8% CER and 1.5% WER for Noakhali dialect
- Wav2Vec2-Bangla (300M) - π€ HuggingFace
- Self-supervised speech model with 17.8% WER
- Trained on OpenSLR Bangla dataset
- BanglaByT5: Byte-Level Modelling for Bangla - π arXiv
- TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking - π arXiv
- TigerLLM: A Family of Bangla Large Language Models - π arXiv
- Bangla/Bengali Seed Dataset for WMT24 - π Paper
- BLUB: A Comprehensive Evaluation Benchmark for Bangla Language Understanding - π Research
- First comprehensive Bangla NLP benchmark with 15+ tasks
- BanglaBook: Large-scale Bangla Dataset for Sentiment Analysis - π ACL 2023
- 158K+ book reviews for sentiment analysis
- Cross-lingual Transfer Learning for Bangla: What Works and What Doesn't - π Findings of ACL 2024
- BanglaBERT: Language Model Pretraining and Benchmarks - π NAACL 2022
- BanglaNLG and BanglaT5: Benchmarks for Bangla NLG - π EACL 2023
- MuRIL: Multilingual Representations for Indian Languages - π Research Paper
- IndicBERT: A Pre-trained Language Model for Indian Languages - π ACL 2022
- Text Summarization Paper - π IEEE
- Natural Language Inference in Bangla - π Research Paper
- Sentiment Analysis in Bangla Text: A Comprehensive Study - π Research
- Named Entity Recognition for Bangla: Challenges and Solutions - π LREC 2022
- Bangla Speech Recognition: Traditional to Neural Approaches - π INTERSPEECH 2023
- Cross-lingual Speech Recognition for Bangla - π ICASSP 2023
- Multimodal Learning for Bangla: Vision and Language - π CVPR 2023
- Cross-lingual Transfer for Low-Resource Languages: A Bangla Case Study - π EMNLP 2023
- Multilingual Models for South Asian Languages - π ACL 2023
- Zero-shot Learning for Bangla NLP Tasks - π Findings of ACL 2023
| Library | Description | Features | Link |
|---|---|---|---|
| BNLP | Bengali Natural Language Processing Toolkit | Tokenization, Embedding, POS, NER | π GitHub |
| BNLTK | Bangla Natural Language Processing Toolkit | Tokenization, Stemming, POS Tagging | π GitHub |
| sbnltk | Bangla NLP toolkit (this repository's toolkit) | Comprehensive NLP suite | π GitHub |
| bnunicode | Unicode normalization for Bangla text | Bijoy to Unicode, normalization | π GitHub |
| pyBanglaKit | Comprehensive Bangla text processing | Tokenization, spell checking, sentiment | π GitHub |
| Indic NLP Library | Multi-Indic language processing | Script conversion, transliteration | π GitHub |
| BanglaTextProcessor | Advanced text processing pipeline | Dependency parsing, coreference | π GitHub |
| Tool | Description | Features | Link |
|---|---|---|---|
| BanglaOCR | Comprehensive OCR system for Bangla | Print & handwriting recognition | π GitHub |
| EasyOCR-Bangla | Ready-to-use OCR solution | Simple Python API | π GitHub |
| TesseractBN | Tesseract with Bangla support | Command-line & API access | π GitHub |
| BanglaHWR | Handwriting recognition system | Real-time recognition | π GitHub |
| Tool | Description | Features | Link |
|---|---|---|---|
| BanglaVoice | Neural TTS system | Natural speech synthesis | π GitHub |
| FastSpeech-Bangla | Fast and robust TTS | Real-time synthesis | π GitHub |
| BanglaPhoneme | Phoneme analysis toolkit | IPA transcription support | π GitHub |
# BNLP installation
pip install bnlp_toolkit
# BNLTK installation
pip install bnltk| Task | Dataset | Metric | Best Model | Score |
|---|---|---|---|---|
| Sentiment Classification | SentNoB | Macro-F1 | BanglaBERT | 72.89 |
| Natural Language Inference | BNLI | Accuracy | BanglaBERT (Large) | 83.41 |
| Named Entity Recognition | MultiCoNER | Micro-F1 | BanglaBERT (Large) | 79.20 |
| Question Answering | BQA/TyDiQA | EM/F1 | BanglaBERT (Large) | 76.10/81.50 |
| Dataset | Task | Size | Description | Link |
|---|---|---|---|---|
| BanglaBook | Sentiment Analysis | 158,065 samples | Book reviews sentiment analysis | π GitHub |
| SentMix-3L | Code-Mixed Sentiment | 1,007 samples | Bangla-English-Hindi code-mixed | π GitHub |
| Awesome Bangla Datasets | Various | Multiple | Comprehensive collection | π GitHub |
π Note: I am not the owner of these following datasets. It's just a collection to find amazing peoples and their works.
π Please give them support! Your support will encourage them to do more amazing things.
| Dataset | Description | Link |
|---|---|---|
| Wiki Articles | Wikipedia Articles in Bangla | π Kaggle |
| Bangladesh Protidin | News from Bangladesh Protidin | π Kaggle |
| 40k News Articles | 40k Bangla Newspaper Articles | π Kaggle |
| Largest News Dataset | Bangla Largest Newspaper Dataset | π Kaggle |
| Wikipedia Dumps | All types of Wikipedia Articles | π Wiki Dumps |
| bdNews24 Corpus | bdNews24 largest dataset | π Kaggle |
| Dataset | Description | Size | Link |
|---|---|---|---|
| OpenSLR Bangla | Large-scale speech corpus | 250+ hours, 2000+ speakers | π OpenSLR |
| Common Voice Bangla | Crowdsourced speech data | 500+ hours (growing) | π Mozilla |
| FLEURS Bangla | Cross-lingual speech corpus | 12 hours | π€ HuggingFace |
| BanglaASR Dataset | Fine-tuned ASR corpus | 23.8 hours | π GitHub |
| Text to Speech | Bengali Text to Speech Dataset | Studio quality | π Bengali.ai |
| Speech Recognition | Bengali Automatic Speech Recognition Dataset | Various speakers | π Bengali.ai |
| Regional Dialect ASR | Dialect-specific speech recognition | 100+ hours, 8 dialects | π GitHub |
| Multi-Speaker TTS | Multiple speaker TTS corpus | 20 hours, 10 speakers | π GitHub |
| Expressive TTS Dataset | Emotional speech synthesis | 15 hours, 8 emotions | π GitHub |
| Handwritten Digits | Numta Handwritten Bengali Digits | Visual recognition | π Bengali.ai |
| Dataset | Description | Link |
|---|---|---|
| BanglaBook | Large-scale book reviews (158K samples) | π GitHub |
| SentMix-3L | Code-mixed sentiment (Bangla-English-Hindi) | π GitHub |
| Social Media Comments | Bangla Text Dataset from Social Media | π GitHub |
| Sentiment Analysis | Bengali Sentiment Text | π Kaggle |
| News Classification | Classification Bengali News Articles | π Kaggle |
| Drama Review | Bangla Drama Review Dataset | π Figshare |
| News Comments | Bengali News Comments Sentiment | π Kaggle |
| News Headlines | News Headline Classification | π Kaggle |
| Big News Classification | Bangla Newspaper Article Classification (Large) | π Kaggle |
| YouTube Sentiment | Bangla YouTube Sentiment/Emotion Dataset | π Kaggle |
| Multilingual Sentiment | Sentiment Lexicons for 81 Languages | π Kaggle |
| Twitter Dataset | Twitter Sentiment Analysis Dataset | π GitHub |
| EmoNoBa | Emotion analysis on noisy Bangla texts | π GitHub |
| SentiGOLD | Multi-domain sentiment analysis | π GitHub |
| Bangla Emotion Corpus | Comprehensive emotion detection | π GitHub |
| Social Media Sentiment | Social media specific sentiment | π GitHub |
| Bangla Fake News Detection | Misinformation detection dataset | π Kaggle |
| BanglaSarc | Sarcasm detection dataset | π GitHub |
| Complaint Classification | Customer complaint categorization | π GitHub |
| Dataset | Description | Link |
|---|---|---|
| 2.5M Pairs | 2.5M pair sentences - NOT low resource anymore | π GitHub |
| WMT24 Seed Dataset | High-quality manual translations | π Paper |
| TED Dataset | TED dataset (small) | π₯ Download |
| Bangla Dictionary | Bengali Dictionary | π GitHub |
| SUPERA Dataset | SUPARA08M Balanced English-Bangla Parallel Corpus | π IEEE DataPort |
| Samanantar | Large-scale parallel corpus | π AI4Bharat |
| OPUS Collections | Multiple parallel corpora | π OPUS |
| Indic-Indic Translation | Inter-Indic language translation | π GitHub |
| BanglaDialectTranslation | Regional dialect to standard Bangla | π GitHub |
| Vashantor | Multi-regional dialect corpus | π GitHub |
| Legal Translation Corpus | Legal document translation | π GitHub |
| Medical Translation Dataset | Healthcare translation | π GitHub |
| Dataset | Description | Link |
|---|---|---|
| 3k Sentences | 3k POS tag sentences | π GitHub |
| 100k+ Words | Single word tagging 100k+ | π Kaggle |
| Dataset | Description | Link |
|---|---|---|
| 70k Sentences | 70k sentences with 5 types of NER | π GitHub |
| 400k+ Words | Word-level NER 400k+ | π Kaggle |
| B-NER | Comprehensive Bangla NER dataset | π GitHub |
| BanglaPersonNER | Person name extraction | π GitHub |
| Complex NER Dataset | Multi-type entity recognition | π GitHub |
| Medical NER Dataset | Healthcare entity recognition | π GitHub |
| Financial NER Corpus | Finance domain entities | π GitHub |
| Legal Entity Recognition | Legal document entity extraction | π GitHub |
| Bangladesh Geographic NER | Location entity recognition | π GitHub |
| Dataset | Description | Link |
|---|---|---|
| Squad 2.0 Style | Question Answering Squad 2.0 in Bangla | π Kaggle |
| BanglaRQA | Reading comprehension dataset | π GitHub |
| SQuAD-BN | Bangla version of SQuAD | π GitHub |
| Contextual QA Dataset | Multi-context question answering | π GitHub |
| Medical QA Bangla | Healthcare question answering | π GitHub |
| Legal QA Dataset | Legal question answering | π GitHub |
| Educational QA Corpus | Academic question answering | π GitHub |
| Bangla Conversational QA | Multi-turn question answering | π GitHub |
| Dataset | Description | Link |
|---|---|---|
| Article Summarization | Articles Summarization (extractive & abstractive) | π Kaggle |
| BANSData | Dataset for Bengali Abstractive News Summarization | π Kaggle |
| 3 Human Evaluated | Articles with 3 human evaluated summaries | π BNLPC |
| BenSum | Bangla news summarization | π GitHub |
| BanglaNewsSummarization | Extended news corpus | π GitHub |
| BUSUM | Multi-document summarization | π GitHub |
| Academic Paper Summarization | Research paper summarization | π GitHub |
| Book Chapter Summarization | Literature summarization | π GitHub |
| Dataset | Description | Link |
|---|---|---|
| 50k Fake News | 50k Bangla fake news dataset | π Kaggle |
| Dataset | Description | Link |
|---|---|---|
| Ekush | Bangla Handwritten Characters | π Website |
| Bayanno | Multi-purpose handwritten dataset | π Mendeley |
| BN-HTRd | Document Level Offline Bangla HTR (108k words) | π Mendeley |
| Bongabdo | Bangla handwritten script dataset | π Research Paper |
| BanglaOCR Dataset | Comprehensive OCR training data | π GitHub |
| BanglaHWR Dataset | Handwriting recognition corpus | π GitHub |
| Document Layout Analysis | Document understanding dataset | π GitHub |
| Dataset | Description | Link |
|---|---|---|
| BanglaAutoKG | Automatic knowledge graph construction | π GitHub |
| Bangladesh Agricultural KG | Agricultural data integration | π IEEE Access |
| Bangla Wikipedia Knowledge Graph | Structured Wikipedia knowledge | π GitHub |
| Bangla Event Extraction | News event extraction | π GitHub |
| Social Media Event Detection | Real-time event detection | π GitHub |
| Bangla Relation Extraction | Entity relationship extraction | π GitHub |
| Knowledge Base Relations | Structured knowledge extraction | π GitHub |
| Aspect-Based Opinion Mining | Detailed opinion analysis | π GitHub |
| Bangla Semantic Textual Similarity | Sentence similarity dataset | π GitHub |
| Concept Mapping Dataset | Conceptual relationship mapping | π GitHub |
| Bangla WordNet | Lexical semantic network | π GitHub |
| Dataset | Description | Size | Link |
|---|---|---|---|
| BanglaLM | Large language modeling corpus | 27.5 GB | π GitHub |
| Indic Corpus | Multi-lingual Indic corpus | 6.5 GB Bangla | π AI4Bharat |
| CC-100 Bangla | CommonCrawl Bangla subset | 8.3 GB | π StatMT |
| OSCAR Bangla | Web-crawled multilingual corpus | 12 GB | π OSCAR |
| Bangla Poetry Corpus | Classical and modern poetry | 25,000+ poems | π GitHub |
| Literary Text Collection | Bangla literature corpus | 10,000+ books | π GitHub |
| Academic Text Corpus | Scholarly text collection | 50,000+ papers | π GitHub |
| Bangla Morphological Analyzer | Morphological analysis dataset | 100,000+ word-morpheme pairs | π GitHub |
| Phonetic Transcription Corpus | IPA transcription dataset | 50,000+ word-pronunciation pairs | π GitHub |
| Dataset | Description | Size | Link |
|---|---|---|---|
| Bangla Image Captioning | Image description generation | 50,000+ image-caption pairs | π GitHub |
| Visual Question Answering Bangla | Visual reasoning dataset | 25,000+ image-question-answer | π GitHub |
| Bangla Video Captioning | Video description dataset | 5,000+ video-caption pairs | π GitHub |
| Sign Language Recognition | Bangla sign language dataset | 10,000+ sign videos | π GitHub |
| Music-Text Alignment | Song lyrics alignment | 2,000+ song-lyric pairs | π GitHub |
| Dataset | Description | Link |
|---|---|---|
| Numbers with Words | Bengali numbers with words | π Kaggle |
| Image to Text | Bangla Natural Language Image to Text (BnLiT) | π Kaggle |
Coming soon...
Documentation for usage and contribution guidelines coming soon...
- For Pre-trained Models: Visit HuggingFace model hub links above
- For Tools: Install Python libraries like BNLP or BNLTK
- For Datasets: Follow the individual dataset links and instructions
- For Research: Check out the latest papers and benchmarks
- π Submit new datasets through pull requests
- π Report issues or broken links
- π‘ Suggest improvements to the documentation
- π¬ Share your research findings
β If you find this repository helpful, please give it a star! β
π€ Contributions are welcome! Feel free to submit issues and pull requests.
π¬ Questions? Open an issue or contact the maintainers.
π Special thanks to all the researchers and developers who contributed to Bangla NLP!
If this repository has been helpful to you, consider supporting the project:
Your support helps maintain and improve this resource for the Bangla NLP community! π