Skip to content

dsfsi/za-mafoko

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Mafoko: South African Terminology, Lexicon and Glossary Project

Give Feedback πŸ“‘: DSFSI Resource Feedback Form

DOI arXiv

Table of contents

  1. Project Description
  2. Getting Started
  3. Authors
  4. Attribution

Project Description


The "Mafoko: South African Terminology, Lexicon, and Glossary Project" is dedicated to the comprehensive collection, meticulous cleaning, and transformative processing of South African language terminology lists, lexicons, and glossaries. This initiative is an integral part of the broader mission of the Data Science for Social Impact (DSFSI) lab/group, which aims to liberate and openly share as many language resources as possible.

The quality and accuracy of each resource are maintained by the original authors, ensuring the integrity and authenticity of the linguistic data. For any questions or clarifications regarding the content, users are encouraged to directly contact the original authors. By making these linguistic assets readily accessible, the project seeks to enhance language preservation, support linguistic research, and foster educational opportunities across South Africa's diverse linguistic landscape.

Disclaimer

Each resource is provided on an β€œas is” basis, without representations, warranties or conditions of any kind, either express or implied including, without limitation, any warranties or conditions of title, non-infringement, merchantability or fitness for a particular purpose. We shall not have any liability for any form or type of damages (including without limitation lost profits), however caused and on any theory of liability, whether in contract, strict liability, or delict (including negligence or otherwise) arising in any way out of any of the resources, even if advised of the possibility of such damages. Likewise, to the full extent permitted by law, we shall not have any liability whatsoever for any mistakes in the source data or for any disputed translations.

Where any user finds technical mistakes or errors in the files, they may submit a request for fixes via Github.

Databases

Database Description Documentation CSV JSONL
DSAC Department of Sports, Arts and Culture (DSAC) project is to support the collaborative development and dissemination of terminological resources, and thereby promoting the use of African languages in teaching and learning at higher education institutions. README data/dsac/combined_dsac.csv,
view on datasette
data/dsac/combined_dsac.jsonl,
view on datasette
StatsSA The Multilingual Statistical Terminology Project by Stats SA develops statistical terminology in South Africa's 11 official languages to enhance access to vital data for all citizens, ensuring a deeper understanding and connection to the information that affects their lives. README data/statssa/statssa_multilingual_statistical_terminology.csv, view on datasette data/statssa/statssa_multilingual_statistical_terminology.jsonl, view on datasette
UNISA Multilingual The South African Multilingual Linguistic Terminology (SAMLT) Project is a comprehensive multilingual termbank containing 500 linguistic terms translated across nine South African languages. Each term includes translations by field experts, accompanied by concise definitions and usage examples to clarify technical linguistic concepts for classroom and academic use. This resource addresses the critical need for standardized linguistic terminology in African languages, supporting linguistics education and research across South Africa's diverse linguistic landscape. README data/unisa_multilingual/unisa_multilingual_linguistic_terminology.csv, view on datasette data/unisa_multilingual/unisa_multilingual_linguistic_terminology.jsonl, view on datasette
UNISA Robotics The UNISA Multilingual Robotics Glossary is a comprehensive collection of approximately 100 robotics and engineering terminology entries translated across South Africa's 11 official languages. This glossary was developed by the University of South Africa (UNISA) through its Inspired towards Science, Engineering and Technology (I-SET) program, in collaboration with the Department of Linguistics and Modern Languages and the Department of African Languages. This resource aims to make robotics education accessible in mother-tongue languages throughout South Africa, supporting STEM education and bridging the gap between technical terminology and linguistic diversity. README data/unisa_robotics/unisa_robotics_multilingual_glossary.csv, view on datasette data/unisa_robotics/unisa_robotics_multilingual_glossary.jsonl, view on datasette
UP Glossary The University of Pretoria Multilingual Academic Glossaries project promotes access to academic terminology in Afrikaans, English, and Northern Sotho to support multilingual teaching and learning, fostering inclusivity and linguistic diversity in higher education. README data/up_glossary/combined/combined_up_glossary.csv, view on datasette data/up_glossary/combined/combined_up_glossary.jsonl, view on datasette
OERTB Open Resource Term Bank (OERTB) project is to support the collaborative development and dissemination of terminological resources, and thereby promoting the use of African languages in teaching and learning at higher education institutions. TBA TBA TBA

Licence

Getting Started


This section provides the necessary information for a user to be able to run the code locally.

Usage

Authors


This repository is authored by the below team members.

Outside the PI and team lead, all other members are listed alphabetically by surname.

  • Written by : Vukosi Marivate (PI)*, Fiskani Banda, Richard Lastrucci, Mohlatlego Nakeng, Kayode Olalaye, Thapelo Sindane
  • Contact details : [email protected]

Contributions

This is optional and provides information about which and how each of the developers contributed.

Attribution

  1. We ask you reference individual datasets you are using as well as this project.
  2. For individual datasets, please refer to their READMEs in their dataset folders.

For the overall project the citation should be

@dataset{dsfsi-mafoko,
	date = {2023},
	title = {Mafoko: South African Terminology, Lexicon and Glossary Project},
	url = {https://github.com/dsfsi/za-mafoko/},
  author = {Vukosi Marivate and Fiskani Banda and Richard Lastrucci and Matome Ledwaba and Keabetswe Madumo and Mohlatlego Nakeng and Kayode Olalaye and Thapelo Sindane and DSFSI}
}
@article{marivate2025mafokostructuringbuildingopen,
title={Mafoko: Structuring and Building Open Multilingual Terminologies for South African NLP}, 
author={Vukosi Marivate and Isheanesu Dzingirai and Fiskani Banda and Richard Lastrucci and Thapelo Sindane and Keabetswe Madumo and Kayode Olaleye and Abiodun Modupe and Unarine Netshifhefhe and Herkulaas Combrink and Mohlatlego Nakeng and Matome Ledwaba},
year={2025},
eprint={2508.03529},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.03529}, 
}

DSAC Attribution

Attribution Name Dataset Link
DSAC Election Terminology Attribution CSV Dataset
DSAC Life Orientation Terminology Attribution CSV Dataset
DSAC Arts & Culture Terminology – Intermediate Phase Attribution CSV Dataset
DSAC Engineering & Construction Terminology Attribution CSV Dataset
DSAC Financial Terminology Attribution CSV Dataset
DSAC HIV and AIDS Terminology Attribution CSV Dataset
DSAC Human, Social, Economic & Management Sciences Terminology Attribution CSV Dataset
DSAC ICT Dictionary Attribution CSV Dataset
DSAC Mathematics Dictionary (Grades R–6) Attribution CSV Dataset
DSAC Parliamentary Dictionary Attribution CSV Dataset
DSAC Soccer Terminology Attribution CSV Dataset
DSAC Natural Science & Technology Term List – Nguni Languages Attribution CSV Dataset
DSAC Natural Sciences & Technology Dictionary – Sotho Attribution CSV Dataset
DSAC Natural Sciences & Technology (Grades 4–6) Attribution CSV Dataset
DSAC Pharmacy Terminology – First Edition Attribution CSV Dataset
DSAC Pharmacy Terminology – Second Edition Attribution CSV Dataset