🇧🇩 Bangla NLP Dataset

A comprehensive collection of Bangla NLP datasets for researchers and developers

⚠️ IMPORTANT NOTICES ⚠️

🔄 OUR DATASET IS IN LFS MODE! SO YOU HAVE TO CLONE IT FOR GETTING DATA!

🚀 WE WILL SOON UPLOAD ALL DEEP LEARNING BASED DATASETS!

📑 Table of Contents

📖 About
🎯 sbnltk Dataset List
🤖 Pre-trained Language Models
📄 Research Papers
🔧 Modern NLP Tools and Libraries
📊 Benchmarking and Evaluation
🌟 Existing Datasets
💡 Motivation
🤝 Usage and Contribute

📖 About

Bangla NLP dataset repository containing sbnltk datasets, which were used in Bangla nlp toolkit - sbnltk.

This repository also serves as a comprehensive collection of existing Bangla NLP datasets created by the amazing Bangla NLP research community.

🎯 sbnltk Dataset List (DUMP & HUMAN Evaluated) (sbnltk Dataset)

Dataset	Description	Link
Number List	Bangla Number List	📥 Download
Root Word List	Bangla root word List	📥 Download
Word List	Bangla Word List (highest to lowest occurrence)	📥 Download
Wiki Dump	Bangla Wiki Dump word	📥 Download
POS Tag Static	Bangla POStag static dataset (single word)	📥 Download
NER Static	Bangla NER Static Dataset (single word)	📥 Download
Stop Words	Bangla Stop word list	📥 Download
Dump POS Tag	Bangla Dump Pos tag	📥 Download
Question Classification	Bangla Dump question Classification Dataset	📥 Download
Sentiment Analysis	Bangla Dump Sentiment Analysis	📥 Download
Translation Dataset	Google Translation Dataset	📥 Download
NER Enhanced	NER Existing Dataset (Modified + adding Date entity)	📥 Download
News Articles	News Article Dataset	📥 Download
POS Converted	POS tag converted Data	📥 Download
POS Human Evaluated	POS tag human evaluated Data	📥 Download
NER Dump (Both)	DUMP NER data (active and passive both)	📥 Download
NER Dump (Active)	DUMP NER data (active only)	📥 Download
Extractive Summarization	Extractive Text Summarization	🔗 GitHub
Abstractive Summarization	Abstractive Text Summarization (newspaper)	📥 Drive \| 📊 Kaggle
Text Classification	News Article Classification (text Classification)	📥 Drive \| 📊 Kaggle
Keywords Classification	Topic Keywords classification (keywords generator)	📥 Drive \| 📊 Kaggle

🤖 Pre-trained Language Models

BERT-based Models

Model	Description	Parameters	Link
BanglaBERT	ELECTRA-based model, state-of-the-art Bangla NLU	110M	🤗 HuggingFace
BanglishBERT	Bilingual (Bangla+English) BERT	110M	🤗 HuggingFace
BanglaBERT (Small)	Lightweight version for resource-constrained environments	13M	🤗 HuggingFace
BanglaBERT (Large)	Large variant with enhanced performance	335M	🤗 HuggingFace
Bangla BERT Base	Another popular BERT implementation	110M	🤗 HuggingFace
Bangla Electra	ELECTRA-based model for Bangla	13.5M	🤗 HuggingFace

Generative Models (T5/GPT-style)

Model	Description	Parameters	Link
BanglaT5	T5-based sequence-to-sequence model	247M	🤗 HuggingFace
BanglaByT5	Byte-level T5 model for Bangla	Small	📄 Research Paper
TituLLMs	Family of Bangla LLMs (1B & 3B)	1B/3B	📄 Research Paper
TigerLLM	Bangla Large Language Models family	Various	📄 Research Paper
GPT2-Bangla	GPT-2 adapted for Bangla text generation	117M	🤗 HuggingFace
BanglaNLG	Natural language generation for Bangla	Various	🤗 HuggingFace

Speech Models

Model	Description	Performance	Link
Wav2Vec2-Bangla-300M	Self-supervised speech recognition	17.8% WER	🤗 HuggingFace
Whisper-Bangla	OpenAI Whisper fine-tuned for Bangla	Various sizes	🤗 HuggingFace
BanglaASR	Fine-tuned ASR model	14.73% WER	🔗 GitHub

Multilingual Models with Strong Bangla Support

Model	Description	Languages	Link
MuRIL	Google's multilingual model with Bangla support	17 Indian	🤗 HuggingFace
IndicBERT	BERT for Indian languages including Bangla	12 Indian	🤗 HuggingFace
sahajBERT	ALBERT-based model for Bangla	18M	🤗 HuggingFace

📄 Research Papers

Latest Research (2024-2025)

🧠 Knowledge Graphs and Semantic Analysis

BanglaAutoKG: Automatic Bangla Knowledge Graph Construction with Semantic Neural Graph Filtering - 📖 LREC-COLING 2024 | 💻 Code
- First framework for automatic Bangla KG construction using multilingual LLMs
- GNN-based semantic filtering for improved accuracy
Bangladesh Agricultural Knowledge Graph: Enabling Semantic Integration and Data-driven Analysis - 📖 IEEE Access 2024
- FAIR-compliant agricultural knowledge graph for sustainable farming

🗣️ Speech and Multimodal Processing

BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization - 📖 arXiv 2024
- First end-to-end pipeline for Bangla dialect standardization
- Achieved 0.8% CER and 1.5% WER for Noakhali dialect
Wav2Vec2-Bangla (300M) - 🤗 HuggingFace
- Self-supervised speech model with 17.8% WER
- Trained on OpenSLR Bangla dataset

🌐 Large Language Models and Generation

BanglaByT5: Byte-Level Modelling for Bangla - 📖 arXiv
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking - 📖 arXiv
TigerLLM: A Family of Bangla Large Language Models - 📖 arXiv
Bangla/Bengali Seed Dataset for WMT24 - 📖 Paper

📊 Evaluation and Benchmarking

BLUB: A Comprehensive Evaluation Benchmark for Bangla Language Understanding - 📖 Research
- First comprehensive Bangla NLP benchmark with 15+ tasks
BanglaBook: Large-scale Bangla Dataset for Sentiment Analysis - 📖 ACL 2023
- 158K+ book reviews for sentiment analysis
Cross-lingual Transfer Learning for Bangla: What Works and What Doesn't - 📖 Findings of ACL 2024

Foundational Papers

🏗️ Language Models and Pretraining

BanglaBERT: Language Model Pretraining and Benchmarks - 📖 NAACL 2022
BanglaNLG and BanglaT5: Benchmarks for Bangla NLG - 📖 EACL 2023
MuRIL: Multilingual Representations for Indian Languages - 📖 Research Paper
IndicBERT: A Pre-trained Language Model for Indian Languages - 📖 ACL 2022

📊 Task-Specific Research

Text Summarization Paper - 📖 IEEE
Natural Language Inference in Bangla - 📖 Research Paper
Sentiment Analysis in Bangla Text: A Comprehensive Study - 📖 Research
Named Entity Recognition for Bangla: Challenges and Solutions - 📖 LREC 2022

🗣️ Speech and Multimodal

Bangla Speech Recognition: Traditional to Neural Approaches - 📖 INTERSPEECH 2023
Cross-lingual Speech Recognition for Bangla - 📖 ICASSP 2023
Multimodal Learning for Bangla: Vision and Language - 📖 CVPR 2023

🌐 Cross-lingual and Multilingual Studies

Cross-lingual Transfer for Low-Resource Languages: A Bangla Case Study - 📖 EMNLP 2023
Multilingual Models for South Asian Languages - 📖 ACL 2023
Zero-shot Learning for Bangla NLP Tasks - 📖 Findings of ACL 2023

🔧 Modern NLP Tools and Libraries

Python Libraries

Library	Description	Features	Link
BNLP	Bengali Natural Language Processing Toolkit	Tokenization, Embedding, POS, NER	🔗 GitHub
BNLTK	Bangla Natural Language Processing Toolkit	Tokenization, Stemming, POS Tagging	🔗 GitHub
sbnltk	Bangla NLP toolkit (this repository's toolkit)	Comprehensive NLP suite	🔗 GitHub
bnunicode	Unicode normalization for Bangla text	Bijoy to Unicode, normalization	🔗 GitHub
pyBanglaKit	Comprehensive Bangla text processing	Tokenization, spell checking, sentiment	🔗 GitHub
Indic NLP Library	Multi-Indic language processing	Script conversion, transliteration	🔗 GitHub
BanglaTextProcessor	Advanced text processing pipeline	Dependency parsing, coreference	🔗 GitHub

OCR and Vision Tools

Tool	Description	Features	Link
BanglaOCR	Comprehensive OCR system for Bangla	Print & handwriting recognition	🔗 GitHub
EasyOCR-Bangla	Ready-to-use OCR solution	Simple Python API	🔗 GitHub
TesseractBN	Tesseract with Bangla support	Command-line & API access	🔗 GitHub
BanglaHWR	Handwriting recognition system	Real-time recognition	🔗 GitHub

Speech Processing Tools

Tool	Description	Features	Link
BanglaVoice	Neural TTS system	Natural speech synthesis	🔗 GitHub
FastSpeech-Bangla	Fast and robust TTS	Real-time synthesis	🔗 GitHub
BanglaPhoneme	Phoneme analysis toolkit	IPA transcription support	🔗 GitHub

Installation Examples

# BNLP installation
pip install bnlp_toolkit

# BNLTK installation  
pip install bnltk

📊 Benchmarking and Evaluation

Bangla Language Understanding Benchmark (BLUB)

Task	Dataset	Metric	Best Model	Score
Sentiment Classification	SentNoB	Macro-F1	BanglaBERT	72.89
Natural Language Inference	BNLI	Accuracy	BanglaBERT (Large)	83.41
Named Entity Recognition	MultiCoNER	Micro-F1	BanglaBERT (Large)	79.20
Question Answering	BQA/TyDiQA	EM/F1	BanglaBERT (Large)	76.10/81.50

Recent Datasets for Benchmarking

Dataset	Task	Size	Description	Link
BanglaBook	Sentiment Analysis	158,065 samples	Book reviews sentiment analysis	🔗 GitHub
SentMix-3L	Code-Mixed Sentiment	1,007 samples	Bangla-English-Hindi code-mixed	🔗 GitHub
Awesome Bangla Datasets	Various	Multiple	Comprehensive collection	🔗 GitHub

🌟 Existing Datasets

📝 Note: I am not the owner of these following datasets. It's just a collection to find amazing peoples and their works.
🙏 Please give them support! Your support will encourage them to do more amazing things.

🔗 Awesome Dataset Sources

📰 News Articles and Documents

Dataset	Description	Link
Wiki Articles	Wikipedia Articles in Bangla	📊 Kaggle
Bangladesh Protidin	News from Bangladesh Protidin	📊 Kaggle
40k News Articles	40k Bangla Newspaper Articles	📊 Kaggle
Largest News Dataset	Bangla Largest Newspaper Dataset	📊 Kaggle
Wikipedia Dumps	All types of Wikipedia Articles	🔗 Wiki Dumps
bdNews24 Corpus	bdNews24 largest dataset	📊 Kaggle

🎤 Speech to Text / Text to Speech

Dataset	Description	Size	Link
OpenSLR Bangla	Large-scale speech corpus	250+ hours, 2000+ speakers	🔗 OpenSLR
Common Voice Bangla	Crowdsourced speech data	500+ hours (growing)	🔗 Mozilla
FLEURS Bangla	Cross-lingual speech corpus	12 hours	🤗 HuggingFace
BanglaASR Dataset	Fine-tuned ASR corpus	23.8 hours	🔗 GitHub
Text to Speech	Bengali Text to Speech Dataset	Studio quality	🔗 Bengali.ai
Speech Recognition	Bengali Automatic Speech Recognition Dataset	Various speakers	🔗 Bengali.ai
Regional Dialect ASR	Dialect-specific speech recognition	100+ hours, 8 dialects	🔗 GitHub
Multi-Speaker TTS	Multiple speaker TTS corpus	20 hours, 10 speakers	🔗 GitHub
Expressive TTS Dataset	Emotional speech synthesis	15 hours, 8 emotions	🔗 GitHub
Handwritten Digits	Numta Handwritten Bengali Digits	Visual recognition	🔗 Bengali.ai

😊 Sentiment Analysis / Sentence Classification

Dataset	Description	Link
BanglaBook	Large-scale book reviews (158K samples)	🔗 GitHub
SentMix-3L	Code-mixed sentiment (Bangla-English-Hindi)	🔗 GitHub
Social Media Comments	Bangla Text Dataset from Social Media	🔗 GitHub
Sentiment Analysis	Bengali Sentiment Text	📊 Kaggle
News Classification	Classification Bengali News Articles	📊 Kaggle
Drama Review	Bangla Drama Review Dataset	📊 Figshare
News Comments	Bengali News Comments Sentiment	📊 Kaggle
News Headlines	News Headline Classification	📊 Kaggle
Big News Classification	Bangla Newspaper Article Classification (Large)	📊 Kaggle
YouTube Sentiment	Bangla YouTube Sentiment/Emotion Dataset	📊 Kaggle
Multilingual Sentiment	Sentiment Lexicons for 81 Languages	📊 Kaggle
Twitter Dataset	Twitter Sentiment Analysis Dataset	🔗 GitHub
EmoNoBa	Emotion analysis on noisy Bangla texts	🔗 GitHub
SentiGOLD	Multi-domain sentiment analysis	🔗 GitHub
Bangla Emotion Corpus	Comprehensive emotion detection	🔗 GitHub
Social Media Sentiment	Social media specific sentiment	🔗 GitHub
Bangla Fake News Detection	Misinformation detection dataset	📊 Kaggle
BanglaSarc	Sarcasm detection dataset	🔗 GitHub
Complaint Classification	Customer complaint categorization	🔗 GitHub

🔄 Bangla Machine Translation Dataset

Dataset	Description	Link
2.5M Pairs	2.5M pair sentences - NOT low resource anymore	🔗 GitHub
WMT24 Seed Dataset	High-quality manual translations	📖 Paper
TED Dataset	TED dataset (small)	📥 Download
Bangla Dictionary	Bengali Dictionary	🔗 GitHub
SUPERA Dataset	SUPARA08M Balanced English-Bangla Parallel Corpus	📊 IEEE DataPort
Samanantar	Large-scale parallel corpus	🔗 AI4Bharat
OPUS Collections	Multiple parallel corpora	🔗 OPUS
Indic-Indic Translation	Inter-Indic language translation	🔗 GitHub
BanglaDialectTranslation	Regional dialect to standard Bangla	🔗 GitHub
Vashantor	Multi-regional dialect corpus	🔗 GitHub
Legal Translation Corpus	Legal document translation	🔗 GitHub
Medical Translation Dataset	Healthcare translation	🔗 GitHub

🏷️ Bangla POS Tag Dataset

Dataset	Description	Link
3k Sentences	3k POS tag sentences	🔗 GitHub
100k+ Words	Single word tagging 100k+	📊 Kaggle

🏷️ Bangla NER Dataset

Dataset	Description	Link
70k Sentences	70k sentences with 5 types of NER	🔗 GitHub
400k+ Words	Word-level NER 400k+	📊 Kaggle
B-NER	Comprehensive Bangla NER dataset	🔗 GitHub
BanglaPersonNER	Person name extraction	🔗 GitHub
Complex NER Dataset	Multi-type entity recognition	🔗 GitHub
Medical NER Dataset	Healthcare entity recognition	🔗 GitHub
Financial NER Corpus	Finance domain entities	🔗 GitHub
Legal Entity Recognition	Legal document entity extraction	🔗 GitHub
Bangladesh Geographic NER	Location entity recognition	🔗 GitHub

❓ Question Answering Dataset

Dataset	Description	Link
Squad 2.0 Style	Question Answering Squad 2.0 in Bangla	📊 Kaggle
BanglaRQA	Reading comprehension dataset	🔗 GitHub
SQuAD-BN	Bangla version of SQuAD	🔗 GitHub
Contextual QA Dataset	Multi-context question answering	🔗 GitHub
Medical QA Bangla	Healthcare question answering	🔗 GitHub
Legal QA Dataset	Legal question answering	🔗 GitHub
Educational QA Corpus	Academic question answering	🔗 GitHub
Bangla Conversational QA	Multi-turn question answering	🔗 GitHub

📝 Bangla Text Summarization

Dataset	Description	Link
Article Summarization	Articles Summarization (extractive & abstractive)	📊 Kaggle
BANSData	Dataset for Bengali Abstractive News Summarization	📊 Kaggle
3 Human Evaluated	Articles with 3 human evaluated summaries	🔗 BNLPC
BenSum	Bangla news summarization	🔗 GitHub
BanglaNewsSummarization	Extended news corpus	🔗 GitHub
BUSUM	Multi-document summarization	🔗 GitHub
Academic Paper Summarization	Research paper summarization	🔗 GitHub
Book Chapter Summarization	Literature summarization	🔗 GitHub

🕵️ Bangla Fake News Detection

Dataset	Description	Link
50k Fake News	50k Bangla fake news dataset	📊 Kaggle

🖊️ Handwriting Recognition / OCR

Dataset	Description	Link
Ekush	Bangla Handwritten Characters	🔗 Website
Bayanno	Multi-purpose handwritten dataset	📊 Mendeley
BN-HTRd	Document Level Offline Bangla HTR (108k words)	📊 Mendeley
Bongabdo	Bangla handwritten script dataset	📄 Research Paper
BanglaOCR Dataset	Comprehensive OCR training data	🔗 GitHub
BanglaHWR Dataset	Handwriting recognition corpus	🔗 GitHub
Document Layout Analysis	Document understanding dataset	🔗 GitHub

🌐 Knowledge Graphs and Information Extraction

Dataset	Description	Link
BanglaAutoKG	Automatic knowledge graph construction	🔗 GitHub
Bangladesh Agricultural KG	Agricultural data integration	📄 IEEE Access
Bangla Wikipedia Knowledge Graph	Structured Wikipedia knowledge	🔗 GitHub
Bangla Event Extraction	News event extraction	🔗 GitHub
Social Media Event Detection	Real-time event detection	🔗 GitHub
Bangla Relation Extraction	Entity relationship extraction	🔗 GitHub
Knowledge Base Relations	Structured knowledge extraction	🔗 GitHub
Aspect-Based Opinion Mining	Detailed opinion analysis	🔗 GitHub
Bangla Semantic Textual Similarity	Sentence similarity dataset	🔗 GitHub
Concept Mapping Dataset	Conceptual relationship mapping	🔗 GitHub
Bangla WordNet	Lexical semantic network	🔗 GitHub

📚 Corpus and Language Modeling

Dataset	Description	Size	Link
BanglaLM	Large language modeling corpus	27.5 GB	🔗 GitHub
Indic Corpus	Multi-lingual Indic corpus	6.5 GB Bangla	🔗 AI4Bharat
CC-100 Bangla	CommonCrawl Bangla subset	8.3 GB	🔗 StatMT
OSCAR Bangla	Web-crawled multilingual corpus	12 GB	🔗 OSCAR
Bangla Poetry Corpus	Classical and modern poetry	25,000+ poems	🔗 GitHub
Literary Text Collection	Bangla literature corpus	10,000+ books	🔗 GitHub
Academic Text Corpus	Scholarly text collection	50,000+ papers	🔗 GitHub
Bangla Morphological Analyzer	Morphological analysis dataset	100,000+ word-morpheme pairs	🔗 GitHub
Phonetic Transcription Corpus	IPA transcription dataset	50,000+ word-pronunciation pairs	🔗 GitHub

🖼️ Multimodal Datasets

Dataset	Description	Size	Link
Bangla Image Captioning	Image description generation	50,000+ image-caption pairs	🔗 GitHub
Visual Question Answering Bangla	Visual reasoning dataset	25,000+ image-question-answer	🔗 GitHub
Bangla Video Captioning	Video description dataset	5,000+ video-caption pairs	🔗 GitHub
Sign Language Recognition	Bangla sign language dataset	10,000+ sign videos	🔗 GitHub
Music-Text Alignment	Song lyrics alignment	2,000+ song-lyric pairs	🔗 GitHub

🔧 Miscellaneous

Dataset	Description	Link
Numbers with Words	Bengali numbers with words	📊 Kaggle
Image to Text	Bangla Natural Language Image to Text (BnLiT)	📊 Kaggle

💡 Motivation

Coming soon...

🤝 Usage and Contribute

Documentation for usage and contribution guidelines coming soon...

How to Get Started

For Pre-trained Models: Visit HuggingFace model hub links above
For Tools: Install Python libraries like BNLP or BNLTK
For Datasets: Follow the individual dataset links and instructions
For Research: Check out the latest papers and benchmarks

Contributing Guidelines

📝 Submit new datasets through pull requests
🐛 Report issues or broken links
💡 Suggest improvements to the documentation
🔬 Share your research findings

⭐ If you find this repository helpful, please give it a star! ⭐

🤝 Contributions are welcome! Feel free to submit issues and pull requests.

📬 Questions? Open an issue or contact the maintainers.

🌟 Special thanks to all the researchers and developers who contributed to Bangla NLP!

☕ Support This Project

If this repository has been helpful to you, consider supporting the project:

Your support helps maintain and improve this resource for the Bangla NLP community! 💚

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Bangla Extractive text summarization Dataset/Bangla-Extractive-Text-summarization-Dataset_1		Bangla Extractive text summarization Dataset/Bangla-Extractive-Text-summarization-Dataset_1
Bangla NER Dataset		Bangla NER Dataset
Bangla PosTag Dataset		Bangla PosTag Dataset
Bangla Root word dataset		Bangla Root word dataset
Bangla Sentiment Analysis Dataset		Bangla Sentiment Analysis Dataset
Bangla Stopword Dataset		Bangla Stopword Dataset
Bangla Wiki article		Bangla Wiki article
Bangla Wiki word Dataset		Bangla Wiki word Dataset
Bangla to English google translate		Bangla to English google translate
docs		docs
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

Foysal87/Bangla-NLP-Dataset

Folders and files

Latest commit

History

Repository files navigation