za-mafoko-translation

Research project that tests whether large-language models' translations from English to another South African language such as Tshivenda can be improved by using a Retrieval-Augmented Generation. The approach leverages a domain-specific terminology vector database, such as one focused on South African elections, to enhance the models' translation accuracy.

Environment Configuration

Configure the project by setting environment variables in a .env file. The file must be set in the root of the project:

LLM Configuration

Variable	Description	Accepted Values	Default
`LLM_PROVIDER`	Large Language Model provider	`OPENAI`, `OLLAMA`	Required
`LLM_NAME`	Model name	str	Required
`LLM_KEY`	API key for LLM provider	str	Required
`LLM_TEMPERATURE`	Sampling temperature for generation	float	`0.1` (optional)

Embedding Model Configuration

Variable	Description	Accepted Values	Default
`EMBEDDING_MODEL_PROVIDER`	Embedding model provider	`OPENAI`, `HUGGING_FACE` (inferred by provider)	Required
`EMBEDDING_MODEL_NAME`	Embedding model name	str	Required
`EMBEDDING_MODEL_KEY`	API key for embedding model provider	str	Required

Vector Store Configuration

Variable	Description	Accepted Values	Default
`VECTOR_DB_TYPE`	Vector database type	`FAISS`, `Chroma`	`FAISS`
`TOP_K`	Number of top results retrieved	int	`10`

Term Extraction Strategies

The project supports and tests two main strategies for extracting terms from text:

1. Semantic Term Extraction

This strategy focuses on extracting contextually meaningful terminologies using spaCy and frequency/distribution filtering.

Variable	Description	Default
`SPACY_MODEL_NAME`	English spaCy model name	`en_core_web_sm`
`WORD_FREQUENCY_CUTOFF`	Minimum word frequency cutoff for term selection	`1e-5`
`DISTRIBUTION_CUTOFF`	Distribution threshold for contextually meaningful filtering	`2`

2. Rare Term Extraction

This strategy targets extraction of rare terms based on frequency and distribution thresholds, and applies deduplication to avoid redundancy.

Variable	Description	Default
`WORD_FREQUENCY_CUTOFF`	Minimum word frequency cutoff for rare terms extraction	`1e-5`
`DISTRIBUTION_CUTOFF`	Distribution cutoff threshold for rare terms extraction	`2`
`DEDUPLICATION_THRESHOLD`	Similarity threshold (0-100) for deduplicating terms	`80`

To add a new term extraction strategy, follow these steps:

Create your strategy class in the src/extraction/impl/ directory and inherit from TermExtractionStrategy:

from src.extraction.term_extraction_strategy import TermExtractionStrategy

class MyCustomStrategy(TermExtractionStrategy):
    def extract_terms(self, text: str) -> list[str]:
        # Implement your extraction logic here
        return []

Register your strategy name in src/enums/query_strategy_enum.py:

from enum import Enum

class QueryStrategyEnum(str, Enum):
    SEMANTIC_QUERY = 'SEMANTIC_QUERY'
    RARE_QUERY = 'RARE_QUERY'
    MY_CUSTOM_QUERY = 'MY_CUSTOM_QUERY'     # Add this line

Extend the strategy factory in src/extraction/extraction_factory.py:

from src.extraction.impl.my_custom_strategy import MyCustomStrategy

STRATEGY_REGISTRY: dict[str, Type[TermExtractionStrategy]] = {
    'SEMANTIC_QUERY': SemanticTermExtractionStrategy,
    'RARE_QUERY': RareTermExtractionStrategy,
    'MY_CUSTOM_QUERY': MyCustomStrategy             # Add this line
}

Note: The design patterns used to for the terms extraction are strategy and factory. They were used to automate the creation and testing of all term extraction strategies and allow for extension.

Metadata

Variable	Description	Example
`TERMINOLOGY_FILENAME`	jsonl terminology dictionary filename without extension. it must be place in the `data/test/terminology` directory	`dsac_election_terminology_dictionary`
`FILE_CATEGORY`	Domain or category of file	`ELECTION`, `MATHEMATICS`
`TARGET_LANGUAGE`	Target language for translation	`Tshivenda`, `Afrikaans`, `English`, `isiNdebele`, `isiXhosa`, `isiZulu`, `Sesotho`, `Setswana`, `siSwati`, `Xitsonga`, `Sepedi`

To extende the FILE_CATEGORY, go to `src/enums/file_category_enum.py and add the new category

from enum import Enum

class FileCategoryEnum(Enum):
    ELECTION = 'election'
    MATHEMATICS = 'mathematics'
    CUSTOM_CATEGORY = "custom_category"     # Add this line

Example `.env` File

TERMINOLOGY_FILENAME=dsac_election_terminology_dictionary
FILE_CATEGORY=ELECTION
TARGET_LANGUAGE=Tshivenda

LLM_PROVIDER=OPENAI
LLM_NAME=gpt-4o-mini
LLM_KEY=your_openai_api_key
LLM_TEMPERATURE=0.1

EMBEDDING_MODEL_PROVIDER=OPENAI
EMBEDDING_MODEL_NAME=text-embedding-3-large
EMBEDDING_MODEL_KEY=your_openai_embedding_key

VECTOR_DB_TYPE=FAISS
TOP_K=10

SPACY_MODEL_NAME=en_core_web_sm
WORD_FREQUENCY_CUTOFF=1e-5
DISTRIBUTION_CUTOFF=2
DEDUPLICATION_THRESHOLD=80

Detailed Description of Each Data Directory

Directory	Description
`data/{{file_category_type}}-{{vectorstore_type}}-index/`	Contains the {{vectorstore_type}} index and metadata for the RAG retriever.
`data/test/datasets/`	Stores test dataset used for translation and evaluation. Parallel corpus of English and another South African language {{file_category_type}} sentences e.g. `eng_ven_election_dataset.jsonl`
`data/test/results/`	All outputs from translation and evaluation runs.
`data/test/results/plots/`	Visual comparison plots for metrics like `BLEU`, `CHRF`, `CHRF++`, `BERTScore`. Organized by LLM name.
`data/test/results/translation_with_llm/`	Results from baseline LLM-only translation (no retrieval). Organized per LLM.
`data/test/results/translation_with_rag/`	Results from LLM+RAG translation. • Subdirectories are by LLM (e.g. `gpt-4o-mini`) - strategy (e.g. `semantic_query`, `rare_query`).
`data/test/terminology/`	Contains the terminology dictionary used in term-aware RAG translation. `dsac_election_terminology_dictionary.jsonl`
`data/tracking/`	Tracks evaluation files or test sets that have been processed and stored in vector database. e.g `election_files.txt` that stores all election terminology files

How to Run the Program

Create venv or conda environment, activate it, and install the required dependencies using the following command in the root project: pip install -r requirements.txt
To run the tests and evaluations, use one of the following commands:
- Windows / Powershell:
  - .\.conda_environment_name\bin\python.exe .\src\main.py
  - .\venv\Scripts\python.exe .\src\main.py
- WSL / Linux/ MacOs:
  - ./.conda_environment_name/bin/python ./src/main.py
  - ./venv/bin/python ./src/main.py

Output of the Program

The output of the translation evaluation program is a set of metrics and visualizations that assess and compare the quality of translations generated by:

Standard LLM translation (without RAG)
RAG-enhanced translation using different term extraction strategies (SEMANTIC_QUERY, RARE_QUERY, etc.)

The metrics are stored in a JSONL file where each line contains evaluation results for a single sentence translation:

Key	Description
`model_translation`	The translation generated by the LLM or RAG.
`reference_translation`	The closest matching reference translation sentence from the dataset, selected using highest BERT F1.
`bleu_score`	BLEU score measuring n-gram precision overlap with reference.
`chrf_score`	Character F-score (CHRF), better for morphologically rich languages.
`chrf_plus_plus_score`	CHRF++ with word order of 2.
`bert_score`	`P`: Precision `R`: Recall `F1`: Harmonic mean of P & R.
`index` (optional)	The sentence index in the dataset.

The results currently in the data/test/results directory were computed using Open AI's text-embedding-3-large embedding model, LLaMa's llama3:8b and Open AI's gpt-4o-mini LLMs, with the default parameters.

From the experiments, it is evident that RAG with different term extraction techniques can improve LLM translation accuracy for low-resource languages.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data/test		data/test
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

za-mafoko-translation

Environment Configuration

LLM Configuration

Embedding Model Configuration

Vector Store Configuration

Term Extraction Strategies

1. Semantic Term Extraction

2. Rare Term Extraction

Metadata

Example `.env` File

Detailed Description of Each Data Directory

How to Run the Program

Output of the Program

About

Uh oh!

Releases

Packages

Languages

dsfsi/za-mafoko-translation

Folders and files

Latest commit

History

Repository files navigation

za-mafoko-translation

Environment Configuration

LLM Configuration

Embedding Model Configuration

Vector Store Configuration

Term Extraction Strategies

1. Semantic Term Extraction

2. Rare Term Extraction

Metadata

Example .env File

Detailed Description of Each Data Directory

How to Run the Program

Output of the Program

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Example `.env` File

Packages