Closes #339 | Update dataloader for Leipzig#483
Closes #339 | Update dataloader for Leipzig#483TysonYu wants to merge 13 commits intoSEACrowd:masterfrom
Conversation
…donesian_madurese_bible_translation.py
SamuelCahyawijaya
left a comment
There was a problem hiding this comment.
Hi @TysonYu, thank you for your contribution! The dataset looks good, nonetheless, there are 2 things that need to be updated:
- We make a small typo in the issue name, the data loader name should be
leipzig_corporainstead ofleipzig_copora, could you change the folder and file name accordingly? - Could you please add subset for different language, so that the dataloader can be use to download only specific-language data?
Thank you!
| SOURCE_VERSION = datasets.Version(_SOURCE_VERSION) | ||
| SEACROWD_VERSION = datasets.Version(_SEACROWD_VERSION) | ||
|
|
||
| BUILDER_CONFIGS = [ |
There was a problem hiding this comment.
Can you add per language subset so that It can be useful as a source of monolingual pertaining data?
There was a problem hiding this comment.
How to add subset? Can you help give an example?
There was a problem hiding this comment.
Hi @TysonYu , sorry for the late reply. I think it should be similar to how we define the monolingual subsets in the cc100.py where we have the combined source and seacrowd_ssp subsets and the per language subsets:
seacrowd-datahub/seacrowd/sea_datasets/cc100/cc100.py
Lines 164 to 199 in a19097e
|
A friendly reminder to follow up, @TysonYu @raileymontalan. |
raileymontalan
left a comment
There was a problem hiding this comment.
Hi @TysonYu, could you please fix the folder name to leipzig_corpora (i.e. seacrowd/sea_datasets/leipzig_corpora/leipzig_corpora.py? And provide per-language subsets.
Other than that, the code LGTM. Thanks!
|
raileymontalan
left a comment
There was a problem hiding this comment.
The _DATASETNAME and DEFAULT_CONFIG_NAME variables ware reverted back to copora again. Please change again to corpora. Thanks.
|
Hi @raileymontalan and @SamuelCahyawijaya, I changed the "copora" to "corpora". Please feel free to let @TysonYu know if other changes are required. |
|
Hi @TysonYu, are you working on creating subsets per language, as per @SamuelCahyawijaya's request? |
|
Hi @TysonYu, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) by 30 May, so it'd be great if we could wrap up the reviewing and merge this PR before then. |
|
Hi @TysonYu, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then. |
|
Hi @TysonYu, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪 Thanks again! |
Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.
Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py(please use only lowercase and underscore for dataset naming)._CITATION,_DATASETNAME,_DESCRIPTION,_HOMEPAGE,_LICENSE,_URLs,_SUPPORTED_TASKS,_SOURCE_VERSION, and_SEACROWD_VERSIONvariables._info(),_split_generators()and_generate_examples()in dataloader script.BUILDER_CONFIGSclass attribute is a list with at least oneSEACrowdConfigfor the source schema and one for a seacrowd schema.datasets.load_datasetfunction.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.