Skip to content

Evaluate and develop prompts for LLMs that can be used to review English-to-Catalan translations in the software localization domain.

Notifications You must be signed in to change notification settings

jordimas/SoftTransEval-CAT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

The goal of this repository is to evaluate and develop prompts for LLMs that can be used to review English-to-Catalan translations in the software localization domain.

This work is used in real world scenarios in the context of reviewing open source translations.

This repository includes three main components:

  • Evaluation dataset
  • Tool to evaluate LLMs and prompts against the dataset
  • Collection of winning prompts for the task

This work has been used to review the Catalan translations of the GNOME project.

Dataset

Dataset is located at dataset/dataset.tmx.

The dataset has the following characteristics:

  • English - Catalan only
  • Contains 1000 translations from the GNOME UI and documentation projects
  • Includes 5% translation errors (it is imbalanced), which have been reviewed and corrected by humans

Evaluation of different models prompts

Prompts are in config/ directory.

Models discarded

model version comment tp fn fp tn precision recall f1 time
qwen3 1 This is the baseline 37 14 438 511 0.08 0.73 0.14 55105
mistral 1 This is the baseline 50 1 787 162 0.06 0.98 0.11 64164

Model selected

model version comment tp fn fp tn precision recall f1 time
gemma3 1 This is the baseline 16 35 33 916 0.33 0.31 0.32 3476
gemma3 2 Pure instructions prompt 18 33 50 899 0.26 0.35 0.3 3732
gemma3 2_1 Pure instructions prompt v2.1 19 32 56 893 0.25 0.37 0.3 3701
gemma3 3 Prompt with samples 14 37 41 908 0.25 0.27 0.26 4156
gemma3 3_1 Prompt with samples v3.1 8 43 12 937 0.4 0.16 0.23 3100
gemma3 4 Super simple prompt 34 17 210 739 0.14 0.67 0.23 8995
gemma3 5 Categorization prompt 33 18 214 735 0.13 0.65 0.22 5974

Notes:

  • Gemma 3 is Gemma 3 27B model quantified at 8 bits
  • This evaluation is done over 1000 strings of which 5.10% contain errors and 94.90% are correct.

For reference, these are the metrics of some commercial models:

model version comment tp fn fp tn precision recall f1 time
gpt-5 1 Default prompt description 11 40 11 938 0.5 0.22 0.3 5424
gpt-5-mini 1 Default prompt description 13 38 22 927 0.37 0.25 0.3 10528
gemini-2.5-flash 1 Default prompt description 13 38 18 931 0.42 0.25 0.32 5573
gemini-2.5-pro 1 Default prompt description 14 37 17 932 0.45 0.27 0.34 11902

Legend:

  • version: version of the prompt
  • comment: comment that describes the prompt
  • tp: true positive
  • fp: false positive
  • fn: false negative
  • tn: true negative

If you are not familiar with these concepts, check the confusion matrix at Wikipedia.

Using the system to review your translation

Our current recommendation is Gemma 3 27B with prompt version 1.

If you have a file in PO format that you want to review, follow these instructions.

  1. Install the necessary dependencies:
pip install -r evaluator/requirements-evaluator.txt
  1. Download the model
make download-models
  1. Run it in your own PO file:
python evaluator/inference.py --input FILE.po

The output is a FILE.txt with all the detected errors. Expect the system to generate a large amount of false positives but the true positives are very useful.

About

Evaluate and develop prompts for LLMs that can be used to review English-to-Catalan translations in the software localization domain.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published