Give Feedback π: DSFSI Resource Feedback Form
The "Mafoko: South African Terminology, Lexicon, and Glossary Project" is dedicated to the comprehensive collection, meticulous cleaning, and transformative processing of South African language terminology lists, lexicons, and glossaries. This initiative is an integral part of the broader mission of the Data Science for Social Impact (DSFSI) lab/group, which aims to liberate and openly share as many language resources as possible.
The quality and accuracy of each resource are maintained by the original authors, ensuring the integrity and authenticity of the linguistic data. For any questions or clarifications regarding the content, users are encouraged to directly contact the original authors. By making these linguistic assets readily accessible, the project seeks to enhance language preservation, support linguistic research, and foster educational opportunities across South Africa's diverse linguistic landscape.
Each resource is provided on an βas isβ basis, without representations, warranties or conditions of any kind, either express or implied including, without limitation, any warranties or conditions of title, non-infringement, merchantability or fitness for a particular purpose. We shall not have any liability for any form or type of damages (including without limitation lost profits), however caused and on any theory of liability, whether in contract, strict liability, or delict (including negligence or otherwise) arising in any way out of any of the resources, even if advised of the possibility of such damages. Likewise, to the full extent permitted by law, we shall not have any liability whatsoever for any mistakes in the source data or for any disputed translations.
Where any user finds technical mistakes or errors in the files, they may submit a request for fixes via Github.
| Database | Description | Documentation | CSV | JSONL |
|---|---|---|---|---|
| DSAC | Department of Sports, Arts and Culture (DSAC) project is to support the collaborative development and dissemination of terminological resources, and thereby promoting the use of African languages in teaching and learning at higher education institutions. | README | data/dsac/combined_dsac.csv, view on datasette |
data/dsac/combined_dsac.jsonl, view on datasette |
| StatsSA | The Multilingual Statistical Terminology Project by Stats SA develops statistical terminology in South Africa's 11 official languages to enhance access to vital data for all citizens, ensuring a deeper understanding and connection to the information that affects their lives. | README | data/statssa/statssa_multilingual_statistical_terminology.csv, view on datasette | data/statssa/statssa_multilingual_statistical_terminology.jsonl, view on datasette |
| UNISA Multilingual | The South African Multilingual Linguistic Terminology (SAMLT) Project is a comprehensive multilingual termbank containing 500 linguistic terms translated across nine South African languages. Each term includes translations by field experts, accompanied by concise definitions and usage examples to clarify technical linguistic concepts for classroom and academic use. This resource addresses the critical need for standardized linguistic terminology in African languages, supporting linguistics education and research across South Africa's diverse linguistic landscape. | README | data/unisa_multilingual/unisa_multilingual_linguistic_terminology.csv, view on datasette | data/unisa_multilingual/unisa_multilingual_linguistic_terminology.jsonl, view on datasette |
| UNISA Robotics | The UNISA Multilingual Robotics Glossary is a comprehensive collection of approximately 100 robotics and engineering terminology entries translated across South Africa's 11 official languages. This glossary was developed by the University of South Africa (UNISA) through its Inspired towards Science, Engineering and Technology (I-SET) program, in collaboration with the Department of Linguistics and Modern Languages and the Department of African Languages. This resource aims to make robotics education accessible in mother-tongue languages throughout South Africa, supporting STEM education and bridging the gap between technical terminology and linguistic diversity. | README | data/unisa_robotics/unisa_robotics_multilingual_glossary.csv, view on datasette | data/unisa_robotics/unisa_robotics_multilingual_glossary.jsonl, view on datasette |
| UP Glossary | The University of Pretoria Multilingual Academic Glossaries project promotes access to academic terminology in Afrikaans, English, and Northern Sotho to support multilingual teaching and learning, fostering inclusivity and linguistic diversity in higher education. | README | data/up_glossary/combined/combined_up_glossary.csv, view on datasette | data/up_glossary/combined/combined_up_glossary.jsonl, view on datasette |
| OERTB | Open Resource Term Bank (OERTB) project is to support the collaborative development and dissemination of terminological resources, and thereby promoting the use of African languages in teaching and learning at higher education institutions. | TBA | TBA | TBA |
- All datasets, unless stated, are licensed under Nwulite Obodo Open Data License - Version 1.0 or CC-BY-NC-SA 2.5 ZA
This section provides the necessary information for a user to be able to run the code locally.
This repository is authored by the below team members.
Outside the PI and team lead, all other members are listed alphabetically by surname.
- Written by : Vukosi Marivate (PI)*, Fiskani Banda, Richard Lastrucci, Mohlatlego Nakeng, Kayode Olalaye, Thapelo Sindane
- Contact details : [email protected]
This is optional and provides information about which and how each of the developers contributed.
- We ask you reference individual datasets you are using as well as this project.
- For individual datasets, please refer to their READMEs in their dataset folders.
For the overall project the citation should be
@dataset{dsfsi-mafoko,
date = {2023},
title = {Mafoko: South African Terminology, Lexicon and Glossary Project},
url = {https://github.com/dsfsi/za-mafoko/},
author = {Vukosi Marivate and Fiskani Banda and Richard Lastrucci and Matome Ledwaba and Keabetswe Madumo and Mohlatlego Nakeng and Kayode Olalaye and Thapelo Sindane and DSFSI}
}
@article{marivate2025mafokostructuringbuildingopen,
title={Mafoko: Structuring and Building Open Multilingual Terminologies for South African NLP},
author={Vukosi Marivate and Isheanesu Dzingirai and Fiskani Banda and Richard Lastrucci and Thapelo Sindane and Keabetswe Madumo and Kayode Olaleye and Abiodun Modupe and Unarine Netshifhefhe and Herkulaas Combrink and Mohlatlego Nakeng and Matome Ledwaba},
year={2025},
eprint={2508.03529},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.03529},
}