A curated list of multilingual and/or non-English benchmarks for Large Language Models (LLMs) or NLP models and tools in general.
Language | Date | Title | Tasks | Links |
---|---|---|---|---|
Basque ๐ช๐ธ๐ซ๐ท | 2022-06 | BasqueGLUE: A Natural Language Understanding Benchmark for Basque | NER, Intent Classification, Slot Filling, Topic Classification, Sentiment Analysis, Stance Detection, QA/NLI, WiC, Coreference Resolution | [paper] [data] |
Bulgarian ๐ง๐ฌ | 2023-07 | bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark | NER, POS Tagging, Sentiment, Check-Worthiness, Humor Detection, NLI, Multi-Choice QA, Factuality Classification | [paper] [code] [data] |
Cantonese ๐ญ๐ฐ๐จ๐ณ | 2024-08 | How Far Can Cantonese NLP Go? Benchmarking Cantonese Capabilities of Large Language Models | Yue-TruthfulQA, Yue-GSM8K, Yue-ARC-C, Yue-MMLU, Yue-TRANS | [paper] |
Catalan ๐ช๐ธ | 2021-12 | The Catalan Language CLUB | NER, POS Tagging, NLI, Document Classification, QA, STS | [paper] [data] |
Chinese ๐จ๐ณ | 2024-09 | CJEval: A Benchmark for Assessing Large Language Models Using Chinese Junior High School Exam Data | Multi-Choice QA, Bool QA, Fill-in-the Blank QA, Analysis QA | [paper] |
Chinese ๐จ๐ณ | 2020-04 | CLUE: A Chinese Language Understanding Evaluation Benchmark | Short / Long Text Classification, Coreference Resolution, Semantic Similarity, Keyword Recongition, NLI, Machine Reading Comprehension | [paper] |
Danish ๐ฉ๐ฐ | 2024-05 | Towards a Danish Semantic Reasoning Benchmark | Inference, Entailment, Synonymy, Similarity, Relatedness, Word Sense Disambiguation (WiC) | [paper] |
Dutch ๐ณ๐ฑ | 2023-12 | DUMB: A Benchmark for Smart Evaluation of Dutch Models | POS Tagging, NER, Word Sense Disambiguation, Pronoun Resolution, Causal Reasoning, NLI, Sentiment Analysis, Document Classification, Question Answering | [paper] |
Finnish ๐ซ๐ฎ | 2020-10 | Towards Fully Bilingual Deep Language Modeling | POS Tagging, NER, Dependency Parsing, Document Classification | [paper] |
German ๐ฉ๐ช | 2024-06 | SuperGLEBer: German Language Understanding Evaluation Benchmark | NER, Document Classification, STS, QA | [paper] |
Hungarian ๐ญ๐บ | 2024-05 | HuLU: Hungarian Language Understanding Benchmark Kit | CoPA, RTE, SST, WNLI, CommitmentBank, ReCoRD QA | [paper] |
Italian ๐ฎ๐น | 2023-07 | UINAUIL: A Unified Benchmark for Italian Natural Language Understanding | Textual Entailment, Event Detection & Classification (EVENTI), Factuality Classification (FactA), Sentiment Analysis (SENTIPOLC), Irony Detection (IronITA), Hate Speech Detection (HaSpeeDe) | [paper] |
Italian ๐ฎ๐น | 2024-06 | The Invalsi Benchmarks: Measuring Linguistic and Mathematical Understanding of Large Language Models in Italian | Locate and Identify Information, Reconstruct Meaning, Reflect on Content/Form, Word Formation, Lexicon and Semantics, Morphology, Spelling, Syntax, Textuality and Pragmatics, Cloze (Fill-in-the-Blank), Multiple Choice (MC), Multiple Complex Choice (MCC), Unique Response (RU), Short Response (RB) | [paper] |
Korean ๐ฐ๐ท | 2024-06 | KMMLU: Measuring Massive Multitask Language Understanding in Korean | Multichoice QA across 45 subjects, including STEM, Humanities, Applied Sciences | [paper] |
Norwegian ๐ณ๐ด | 2023-05 | NorBench -- A Benchmark for Norwegian Language Models | Morpho-syntactic tasks (POS Tagging, Lemmatization, Dependency Parsing), NER, Sentiment Analysis (Document-level, Sentence-level, Targeted), Linguistic Acceptability, Question Answering, Machine Translation, Diagnostics of Harmful Predictions (Gender Bias, Harmfulness) | [paper] [code] |
Polish ๐ต๐ฑ | 2020-05 | KLEJ: Comprehensive Benchmark for Polish Language Understanding | NER, Sentence Relatedness, Textual Entailment, Cyberbullying Detection, Sentiment Analysis (In-Domain & Out-of-Domain), Question Answering, Paraphrase Detection, Sentiment Analysis (Allegro Reviews) | [paper] |
Polish ๐ต๐ฑ | 2022-12 | This is the way: Designing and Compiling LEPISZCZE, a Comprehensive NLP Benchmark for Polish | Sentiment Analysis, Abusive Clauses Detection, Political Advertising Detection, NLI, NER, POS Tagging, Paraphrase Classification, Punctuation Restoration, Dialogue Acts Classification | [paper] |
Portuguese ๐ต๐น๐ง๐ท | 2024-04 | PORTULAN ExtraGLUE Datasets and Models | SST-2, MRPC, STS-B, MNLI, QNLI, RTE, WNLI, BoolQ, MultiRC, CoPA | [paper] |
Romanian ๐ท๐ด | 2021-12 | LiRo: Benchmark and Leaderboard for Romanian Language Tasks | Document Classification, NER, Machine Translation, Sentiment Analysis, POS Tagging, Dependency Parsing, Language Modeling, QA, STS, Gender Debiasing | [paper] [web] |
Russian ๐ท๐บ | 2024-01 | MERA: A Comprehensive LLM Evaluation in Russian | MathLogicQA, MultiQ, PARus, RCB, ruModAr, ruMultiAr, ruOpenBookQA, ruTiE, ruWorldTree, RWSD, SimpleAr, BPS, CheGeKa, LCS, ruHumanEval, ruMMLU, USE, ruDetox, ruEthics, ruHateSpeech, ruHHH | [paper] [web] |
Slovenian ๐ธ๐ฎ | 2022-02 | Slovene SuperGLUE Benchmark: Translation and Evaluation | BoolQ, CB, COPA, MultiRC, RTE, WSC | [paper] |
Swedish ๐ธ๐ช | 2023-12 | Superlim: A Swedish Language Understanding Evaluation Benchmark | Absabank-Imm, Argumentation Sentences, DaLAJ-GED, SweParaphrase, SweDN, SweFAQ, SweNLI, SweWiC, SweWinograd, SuperSim, Swedish Analogy, SweSAT, SweDiagnostics, SweWinogender | [paper] |
Vietnamese ๐ป๐ณ | 2024-06 | ViGLUE: A Vietnamese General Language Understanding Benchmark and Analysis of Vietnamese Language Models | MNLI, QNLI, RTE, VNRTE, WNLI, SST2, VSFC, VSMEC, MRPC, QQP, CoLA, VToC | [paper] [code] [data] |