Research project that tests whether large-language models' translations from English to another South African language such as Tshivenda can be improved by using a Retrieval-Augmented Generation. The approach leverages a domain-specific terminology vector database, such as one focused on South African elections, to enhance the models' translation accuracy.
Configure the project by setting environment variables in a .env file. The file must be set in the root of the project:
| Variable | Description | Accepted Values | Default |
|---|---|---|---|
LLM_PROVIDER |
Large Language Model provider | OPENAI, OLLAMA |
Required |
LLM_NAME |
Model name | str | Required |
LLM_KEY |
API key for LLM provider | str | Required |
LLM_TEMPERATURE |
Sampling temperature for generation | float | 0.1 (optional) |
| Variable | Description | Accepted Values | Default |
|---|---|---|---|
EMBEDDING_MODEL_PROVIDER |
Embedding model provider | OPENAI, HUGGING_FACE (inferred by provider) |
Required |
EMBEDDING_MODEL_NAME |
Embedding model name | str | Required |
EMBEDDING_MODEL_KEY |
API key for embedding model provider | str | Required |
| Variable | Description | Accepted Values | Default |
|---|---|---|---|
VECTOR_DB_TYPE |
Vector database type | FAISS, Chroma |
FAISS |
TOP_K |
Number of top results retrieved | int | 10 |
The project supports and tests two main strategies for extracting terms from text:
This strategy focuses on extracting contextually meaningful terminologies using spaCy and frequency/distribution filtering.
| Variable | Description | Default |
|---|---|---|
SPACY_MODEL_NAME |
English spaCy model name | en_core_web_sm |
WORD_FREQUENCY_CUTOFF |
Minimum word frequency cutoff for term selection | 1e-5 |
DISTRIBUTION_CUTOFF |
Distribution threshold for contextually meaningful filtering | 2 |
This strategy targets extraction of rare terms based on frequency and distribution thresholds, and applies deduplication to avoid redundancy.
| Variable | Description | Default |
|---|---|---|
WORD_FREQUENCY_CUTOFF |
Minimum word frequency cutoff for rare terms extraction | 1e-5 |
DISTRIBUTION_CUTOFF |
Distribution cutoff threshold for rare terms extraction | 2 |
DEDUPLICATION_THRESHOLD |
Similarity threshold (0-100) for deduplicating terms | 80 |
To add a new term extraction strategy, follow these steps:
-
Create your strategy class in the
src/extraction/impl/directory and inherit fromTermExtractionStrategy:from src.extraction.term_extraction_strategy import TermExtractionStrategy class MyCustomStrategy(TermExtractionStrategy): def extract_terms(self, text: str) -> list[str]: # Implement your extraction logic here return []
-
Register your strategy name in
src/enums/query_strategy_enum.py:from enum import Enum class QueryStrategyEnum(str, Enum): SEMANTIC_QUERY = 'SEMANTIC_QUERY' RARE_QUERY = 'RARE_QUERY' MY_CUSTOM_QUERY = 'MY_CUSTOM_QUERY' # Add this line
-
Extend the strategy factory in
src/extraction/extraction_factory.py:from src.extraction.impl.my_custom_strategy import MyCustomStrategy STRATEGY_REGISTRY: dict[str, Type[TermExtractionStrategy]] = { 'SEMANTIC_QUERY': SemanticTermExtractionStrategy, 'RARE_QUERY': RareTermExtractionStrategy, 'MY_CUSTOM_QUERY': MyCustomStrategy # Add this line }
Note: The design patterns used to for the terms extraction are strategy and factory. They were used to automate the creation and testing of all term extraction strategies and allow for extension.
| Variable | Description | Example |
|---|---|---|
TERMINOLOGY_FILENAME |
jsonl terminology dictionary filename without extension. it must be place in the data/test/terminology directory |
dsac_election_terminology_dictionary |
FILE_CATEGORY |
Domain or category of file | ELECTION, MATHEMATICS |
TARGET_LANGUAGE |
Target language for translation | Tshivenda, Afrikaans, English, isiNdebele, isiXhosa, isiZulu, Sesotho, Setswana, siSwati, Xitsonga, Sepedi |
To extende the FILE_CATEGORY, go to `src/enums/file_category_enum.py and add the new category
from enum import Enum
class FileCategoryEnum(Enum):
ELECTION = 'election'
MATHEMATICS = 'mathematics'
CUSTOM_CATEGORY = "custom_category" # Add this lineTERMINOLOGY_FILENAME=dsac_election_terminology_dictionary
FILE_CATEGORY=ELECTION
TARGET_LANGUAGE=Tshivenda
LLM_PROVIDER=OPENAI
LLM_NAME=gpt-4o-mini
LLM_KEY=your_openai_api_key
LLM_TEMPERATURE=0.1
EMBEDDING_MODEL_PROVIDER=OPENAI
EMBEDDING_MODEL_NAME=text-embedding-3-large
EMBEDDING_MODEL_KEY=your_openai_embedding_key
VECTOR_DB_TYPE=FAISS
TOP_K=10
SPACY_MODEL_NAME=en_core_web_sm
WORD_FREQUENCY_CUTOFF=1e-5
DISTRIBUTION_CUTOFF=2
DEDUPLICATION_THRESHOLD=80| Directory | Description |
|---|---|
data/{{file_category_type}}-{{vectorstore_type}}-index/ |
Contains the {{vectorstore_type}} index and metadata for the RAG retriever. |
data/test/datasets/ |
Stores test dataset used for translation and evaluation. Parallel corpus of English and another South African language {{file_category_type}} sentences eng_ven_election_dataset.jsonl |
data/test/results/ |
All outputs from translation and evaluation runs. |
data/test/results/plots/ |
Visual comparison plots for metrics like BLEU, CHRF, CHRF++, BERTScore. Organized by LLM name. |
data/test/results/translation_with_llm/ |
Results from baseline LLM-only translation (no retrieval). Organized per LLM. |
data/test/results/translation_with_rag/ |
Results from LLM+RAG translation. • Subdirectories are by LLM (e.g. gpt-4o-mini) - strategy (e.g. semantic_query, rare_query). |
data/test/terminology/ |
Contains the terminology dictionary used in term-aware RAG translation. dsac_election_terminology_dictionary.jsonl |
data/tracking/ |
Tracks evaluation files or test sets that have been processed and stored in vector database. election_files.txt that stores all election terminology files |
-
Create
venvorcondaenvironment, activate it, and install the required dependencies using the following command in the root project:pip install -r requirements.txt -
To run the tests and evaluations, use one of the following commands:
-
Windows / Powershell:
-
.\.conda_environment_name\bin\python.exe .\src\main.py -
.\venv\Scripts\python.exe .\src\main.py
-
-
WSL / Linux/ MacOs:
-
./.conda_environment_name/bin/python ./src/main.py -
./venv/bin/python ./src/main.py
-
-
The output of the translation evaluation program is a set of metrics and visualizations that assess and compare the quality of translations generated by:
-
Standard LLM translation (without RAG)
-
RAG-enhanced translation using different term extraction strategies (SEMANTIC_QUERY, RARE_QUERY, etc.)
The metrics are stored in a JSONL file where each line contains evaluation results for a single sentence translation:
| Key | Description |
|---|---|
model_translation |
The translation generated by the LLM or RAG. |
reference_translation |
The closest matching reference translation sentence from the dataset, selected using highest BERT F1. |
bleu_score |
BLEU score measuring n-gram precision overlap with reference. |
chrf_score |
Character F-score (CHRF), better for morphologically rich languages. |
chrf_plus_plus_score |
CHRF++ with word order of 2. |
bert_score |
P: PrecisionR: RecallF1: Harmonic mean of P & R. |
index (optional) |
The sentence index in the dataset. |
The results currently in the data/test/results directory were computed using Open AI's text-embedding-3-large embedding model, LLaMa's llama3:8b and Open AI's gpt-4o-mini LLMs, with the default parameters.
From the experiments, it is evident that RAG with different term extraction techniques can improve LLM translation accuracy for low-resource languages.