Skip to content

dsfsi/za-mafoko-translation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

za-mafoko-translation

Research project that tests whether large-language models' translations from English to another South African language such as Tshivenda can be improved by using a Retrieval-Augmented Generation. The approach leverages a domain-specific terminology vector database, such as one focused on South African elections, to enhance the models' translation accuracy.

Environment Configuration

Configure the project by setting environment variables in a .env file. The file must be set in the root of the project:

LLM Configuration

Variable Description Accepted Values Default
LLM_PROVIDER Large Language Model provider OPENAI, OLLAMA Required
LLM_NAME Model name str Required
LLM_KEY API key for LLM provider str Required
LLM_TEMPERATURE Sampling temperature for generation float 0.1 (optional)

Embedding Model Configuration

Variable Description Accepted Values Default
EMBEDDING_MODEL_PROVIDER Embedding model provider OPENAI, HUGGING_FACE (inferred by provider) Required
EMBEDDING_MODEL_NAME Embedding model name str Required
EMBEDDING_MODEL_KEY API key for embedding model provider str Required

Vector Store Configuration

Variable Description Accepted Values Default
VECTOR_DB_TYPE Vector database type FAISS, Chroma FAISS
TOP_K Number of top results retrieved int 10

Term Extraction Strategies

The project supports and tests two main strategies for extracting terms from text:

1. Semantic Term Extraction

This strategy focuses on extracting contextually meaningful terminologies using spaCy and frequency/distribution filtering.

Variable Description Default
SPACY_MODEL_NAME English spaCy model name en_core_web_sm
WORD_FREQUENCY_CUTOFF Minimum word frequency cutoff for term selection 1e-5
DISTRIBUTION_CUTOFF Distribution threshold for contextually meaningful filtering 2

2. Rare Term Extraction

This strategy targets extraction of rare terms based on frequency and distribution thresholds, and applies deduplication to avoid redundancy.

Variable Description Default
WORD_FREQUENCY_CUTOFF Minimum word frequency cutoff for rare terms extraction 1e-5
DISTRIBUTION_CUTOFF Distribution cutoff threshold for rare terms extraction 2
DEDUPLICATION_THRESHOLD Similarity threshold (0-100) for deduplicating terms 80

To add a new term extraction strategy, follow these steps:

  1. Create your strategy class in the src/extraction/impl/ directory and inherit from TermExtractionStrategy:

    from src.extraction.term_extraction_strategy import TermExtractionStrategy
    
    class MyCustomStrategy(TermExtractionStrategy):
        def extract_terms(self, text: str) -> list[str]:
            # Implement your extraction logic here
            return []
  2. Register your strategy name in src/enums/query_strategy_enum.py:

    from enum import Enum
    
    class QueryStrategyEnum(str, Enum):
        SEMANTIC_QUERY = 'SEMANTIC_QUERY'
        RARE_QUERY = 'RARE_QUERY'
        MY_CUSTOM_QUERY = 'MY_CUSTOM_QUERY'     # Add this line
  3. Extend the strategy factory in src/extraction/extraction_factory.py:

    from src.extraction.impl.my_custom_strategy import MyCustomStrategy
    
    STRATEGY_REGISTRY: dict[str, Type[TermExtractionStrategy]] = {
        'SEMANTIC_QUERY': SemanticTermExtractionStrategy,
        'RARE_QUERY': RareTermExtractionStrategy,
        'MY_CUSTOM_QUERY': MyCustomStrategy             # Add this line
    }

Note: The design patterns used to for the terms extraction are strategy and factory. They were used to automate the creation and testing of all term extraction strategies and allow for extension.

Metadata

Variable Description Example
TERMINOLOGY_FILENAME jsonl terminology dictionary filename without extension. it must be place in the data/test/terminology directory dsac_election_terminology_dictionary
FILE_CATEGORY Domain or category of file ELECTION, MATHEMATICS
TARGET_LANGUAGE Target language for translation Tshivenda, Afrikaans, English, isiNdebele, isiXhosa, isiZulu, Sesotho, Setswana, siSwati, Xitsonga, Sepedi

To extende the FILE_CATEGORY, go to `src/enums/file_category_enum.py and add the new category

from enum import Enum

class FileCategoryEnum(Enum):
    ELECTION = 'election'
    MATHEMATICS = 'mathematics'
    CUSTOM_CATEGORY = "custom_category"     # Add this line

Example .env File

TERMINOLOGY_FILENAME=dsac_election_terminology_dictionary
FILE_CATEGORY=ELECTION
TARGET_LANGUAGE=Tshivenda

LLM_PROVIDER=OPENAI
LLM_NAME=gpt-4o-mini
LLM_KEY=your_openai_api_key
LLM_TEMPERATURE=0.1

EMBEDDING_MODEL_PROVIDER=OPENAI
EMBEDDING_MODEL_NAME=text-embedding-3-large
EMBEDDING_MODEL_KEY=your_openai_embedding_key

VECTOR_DB_TYPE=FAISS
TOP_K=10

SPACY_MODEL_NAME=en_core_web_sm
WORD_FREQUENCY_CUTOFF=1e-5
DISTRIBUTION_CUTOFF=2
DEDUPLICATION_THRESHOLD=80

Detailed Description of Each Data Directory

Directory Description
data/{{file_category_type}}-{{vectorstore_type}}-index/ Contains the {{vectorstore_type}} index and metadata for the RAG retriever.
data/test/datasets/ Stores test dataset used for translation and evaluation.
Parallel corpus of English and another South African language {{file_category_type}} sentences
  • e.g. eng_ven_election_dataset.jsonl
  • data/test/results/ All outputs from translation and evaluation runs.
    data/test/results/plots/ Visual comparison plots for metrics like BLEU, CHRF, CHRF++, BERTScore. Organized by LLM name.
    data/test/results/translation_with_llm/ Results from baseline LLM-only translation (no retrieval). Organized per LLM.
    data/test/results/translation_with_rag/ Results from LLM+RAG translation.
    • Subdirectories are by LLM (e.g. gpt-4o-mini) - strategy (e.g. semantic_query, rare_query).
    data/test/terminology/ Contains the terminology dictionary used in term-aware RAG translation.
  • dsac_election_terminology_dictionary.jsonl
  • data/tracking/ Tracks evaluation files or test sets that have been processed and stored in vector database.
  • e.g election_files.txt that stores all election terminology files

  • How to Run the Program

    1. Create venv or conda environment, activate it, and install the required dependencies using the following command in the root project: pip install -r requirements.txt

    2. To run the tests and evaluations, use one of the following commands:

      • Windows / Powershell:

        • .\.conda_environment_name\bin\python.exe .\src\main.py

        • .\venv\Scripts\python.exe .\src\main.py

      • WSL / Linux/ MacOs:

        • ./.conda_environment_name/bin/python ./src/main.py

        • ./venv/bin/python ./src/main.py

    Output of the Program

    The output of the translation evaluation program is a set of metrics and visualizations that assess and compare the quality of translations generated by:

    1. Standard LLM translation (without RAG)

    2. RAG-enhanced translation using different term extraction strategies (SEMANTIC_QUERY, RARE_QUERY, etc.)

    The metrics are stored in a JSONL file where each line contains evaluation results for a single sentence translation:

    Key Description
    model_translation The translation generated by the LLM or RAG.
    reference_translation The closest matching reference translation sentence from the dataset, selected using highest BERT F1.
    bleu_score BLEU score measuring n-gram precision overlap with reference.
    chrf_score Character F-score (CHRF), better for morphologically rich languages.
    chrf_plus_plus_score CHRF++ with word order of 2.
    bert_score
  • P: Precision
  • R: Recall
  • F1: Harmonic mean of P & R.
  • index (optional) The sentence index in the dataset.

    The results currently in the data/test/results directory were computed using Open AI's text-embedding-3-large embedding model, LLaMa's llama3:8b and Open AI's gpt-4o-mini LLMs, with the default parameters.

    From the experiments, it is evident that RAG with different term extraction techniques can improve LLM translation accuracy for low-resource languages.

    About

    No description, website, or topics provided.

    Resources

    Stars

    Watchers

    Forks

    Releases

    No releases published

    Packages

    No packages published

    Languages