Skip to content

Create dataset loader for Leipzig Corpora Collection #339

@SamuelCahyawijaya

Description

@SamuelCahyawijaya

Dataloader name: leipzig_corpora/leipzig_corpora.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?leipzig_copora

Dataset leipzig_copora
Description This is a collection of corpora in different languages, all built by randomly selecting sentences from web and newspaper sources. Each language has its own directory containing .txt files that list the words and sentences in the corpus, map words or sentences to their sources, and show the cooccurrence of words. The 2017 Community version of the collection contains text material crawled from different websites and contains data for 20 SEA languages.
Subsets -
Languages ban, bjn, bew, bcl, mya, ceb, hil, ind, khm, lao, zsm, min, pam, pag, ksw, tgl, tha, vie, war, jav, mad
Tasks Language Modeling
License Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage https://wortschatz.uni-leipzig.de/en/download
HF URL -
Paper URL http://www.lrec-conf.org/proceedings/lrec2012/pdf/327_Paper.pdf

Metadata

Metadata

Assignees

Labels

bonus +3pr-readyA PR that closes this issue is Ready to be reviewed

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions