Skip to content

A curated list of multilingual and/or non-English benchmarks for Large Language Models (LLMs) or NLP models and tools in general.

Notifications You must be signed in to change notification settings

NaiveNeuron/awesome-multilingual-llm-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 

Repository files navigation

awesome-multilingual-llm-benchmarks

A curated list of multilingual and/or non-English benchmarks for Large Language Models (LLMs) or NLP models and tools in general.

Language-specific Benchmarks

Language Date Title Tasks Links
Basque ๐Ÿ‡ช๐Ÿ‡ธ๐Ÿ‡ซ๐Ÿ‡ท 2022-06 BasqueGLUE: A Natural Language Understanding Benchmark for Basque NER, Intent Classification, Slot Filling, Topic Classification, Sentiment Analysis, Stance Detection, QA/NLI, WiC, Coreference Resolution [paper] [data]
Bulgarian ๐Ÿ‡ง๐Ÿ‡ฌ 2023-07 bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark NER, POS Tagging, Sentiment, Check-Worthiness, Humor Detection, NLI, Multi-Choice QA, Factuality Classification [paper] [code] [data]
Cantonese ๐Ÿ‡ญ๐Ÿ‡ฐ๐Ÿ‡จ๐Ÿ‡ณ 2024-08 How Far Can Cantonese NLP Go? Benchmarking Cantonese Capabilities of Large Language Models Yue-TruthfulQA, Yue-GSM8K, Yue-ARC-C, Yue-MMLU, Yue-TRANS [paper]
Catalan ๐Ÿ‡ช๐Ÿ‡ธ 2021-12 The Catalan Language CLUB NER, POS Tagging, NLI, Document Classification, QA, STS [paper] [data]
Chinese ๐Ÿ‡จ๐Ÿ‡ณ 2024-09 CJEval: A Benchmark for Assessing Large Language Models Using Chinese Junior High School Exam Data Multi-Choice QA, Bool QA, Fill-in-the Blank QA, Analysis QA [paper]
Chinese ๐Ÿ‡จ๐Ÿ‡ณ 2020-04 CLUE: A Chinese Language Understanding Evaluation Benchmark Short / Long Text Classification, Coreference Resolution, Semantic Similarity, Keyword Recongition, NLI, Machine Reading Comprehension [paper]
Danish ๐Ÿ‡ฉ๐Ÿ‡ฐ 2024-05 Towards a Danish Semantic Reasoning Benchmark Inference, Entailment, Synonymy, Similarity, Relatedness, Word Sense Disambiguation (WiC) [paper]
Dutch ๐Ÿ‡ณ๐Ÿ‡ฑ 2023-12 DUMB: A Benchmark for Smart Evaluation of Dutch Models POS Tagging, NER, Word Sense Disambiguation, Pronoun Resolution, Causal Reasoning, NLI, Sentiment Analysis, Document Classification, Question Answering [paper]
Finnish ๐Ÿ‡ซ๐Ÿ‡ฎ 2020-10 Towards Fully Bilingual Deep Language Modeling POS Tagging, NER, Dependency Parsing, Document Classification [paper]
German ๐Ÿ‡ฉ๐Ÿ‡ช 2024-06 SuperGLEBer: German Language Understanding Evaluation Benchmark NER, Document Classification, STS, QA [paper]
Hungarian ๐Ÿ‡ญ๐Ÿ‡บ 2024-05 HuLU: Hungarian Language Understanding Benchmark Kit CoPA, RTE, SST, WNLI, CommitmentBank, ReCoRD QA [paper]
Italian ๐Ÿ‡ฎ๐Ÿ‡น 2023-07 UINAUIL: A Unified Benchmark for Italian Natural Language Understanding Textual Entailment, Event Detection & Classification (EVENTI), Factuality Classification (FactA), Sentiment Analysis (SENTIPOLC), Irony Detection (IronITA), Hate Speech Detection (HaSpeeDe) [paper]
Italian ๐Ÿ‡ฎ๐Ÿ‡น 2024-06 The Invalsi Benchmarks: Measuring Linguistic and Mathematical Understanding of Large Language Models in Italian Locate and Identify Information, Reconstruct Meaning, Reflect on Content/Form, Word Formation, Lexicon and Semantics, Morphology, Spelling, Syntax, Textuality and Pragmatics, Cloze (Fill-in-the-Blank), Multiple Choice (MC), Multiple Complex Choice (MCC), Unique Response (RU), Short Response (RB) [paper]
Korean ๐Ÿ‡ฐ๐Ÿ‡ท 2024-06 KMMLU: Measuring Massive Multitask Language Understanding in Korean Multichoice QA across 45 subjects, including STEM, Humanities, Applied Sciences [paper]
Norwegian ๐Ÿ‡ณ๐Ÿ‡ด 2023-05 NorBench -- A Benchmark for Norwegian Language Models Morpho-syntactic tasks (POS Tagging, Lemmatization, Dependency Parsing), NER, Sentiment Analysis (Document-level, Sentence-level, Targeted), Linguistic Acceptability, Question Answering, Machine Translation, Diagnostics of Harmful Predictions (Gender Bias, Harmfulness) [paper] [code]
Polish ๐Ÿ‡ต๐Ÿ‡ฑ 2020-05 KLEJ: Comprehensive Benchmark for Polish Language Understanding NER, Sentence Relatedness, Textual Entailment, Cyberbullying Detection, Sentiment Analysis (In-Domain & Out-of-Domain), Question Answering, Paraphrase Detection, Sentiment Analysis (Allegro Reviews) [paper]
Polish ๐Ÿ‡ต๐Ÿ‡ฑ 2022-12 This is the way: Designing and Compiling LEPISZCZE, a Comprehensive NLP Benchmark for Polish Sentiment Analysis, Abusive Clauses Detection, Political Advertising Detection, NLI, NER, POS Tagging, Paraphrase Classification, Punctuation Restoration, Dialogue Acts Classification [paper]
Portuguese ๐Ÿ‡ต๐Ÿ‡น๐Ÿ‡ง๐Ÿ‡ท 2024-04 PORTULAN ExtraGLUE Datasets and Models SST-2, MRPC, STS-B, MNLI, QNLI, RTE, WNLI, BoolQ, MultiRC, CoPA [paper]
Romanian ๐Ÿ‡ท๐Ÿ‡ด 2021-12 LiRo: Benchmark and Leaderboard for Romanian Language Tasks Document Classification, NER, Machine Translation, Sentiment Analysis, POS Tagging, Dependency Parsing, Language Modeling, QA, STS, Gender Debiasing [paper] [web]
Russian ๐Ÿ‡ท๐Ÿ‡บ 2024-01 MERA: A Comprehensive LLM Evaluation in Russian MathLogicQA, MultiQ, PARus, RCB, ruModAr, ruMultiAr, ruOpenBookQA, ruTiE, ruWorldTree, RWSD, SimpleAr, BPS, CheGeKa, LCS, ruHumanEval, ruMMLU, USE, ruDetox, ruEthics, ruHateSpeech, ruHHH [paper] [web]
Slovenian ๐Ÿ‡ธ๐Ÿ‡ฎ 2022-02 Slovene SuperGLUE Benchmark: Translation and Evaluation BoolQ, CB, COPA, MultiRC, RTE, WSC [paper]
Swedish ๐Ÿ‡ธ๐Ÿ‡ช 2023-12 Superlim: A Swedish Language Understanding Evaluation Benchmark Absabank-Imm, Argumentation Sentences, DaLAJ-GED, SweParaphrase, SweDN, SweFAQ, SweNLI, SweWiC, SweWinograd, SuperSim, Swedish Analogy, SweSAT, SweDiagnostics, SweWinogender [paper]
Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ 2024-06 ViGLUE: A Vietnamese General Language Understanding Benchmark and Analysis of Vietnamese Language Models MNLI, QNLI, RTE, VNRTE, WNLI, SST2, VSFC, VSMEC, MRPC, QQP, CoLA, VToC [paper] [code] [data]

About

A curated list of multilingual and/or non-English benchmarks for Large Language Models (LLMs) or NLP models and tools in general.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published