The goal of this repository is to evaluate and develop prompts for LLMs that can be used to review English-to-Catalan translations in the software localization domain.
This work is used in real world scenarios in the context of reviewing open source translations.
This repository includes three main components:
- Evaluation dataset
- Tool to evaluate LLMs and prompts against the dataset
- Collection of winning prompts for the task
This work has been used to review the Catalan translations of the GNOME project.
Dataset is located at dataset/dataset.tmx.
The dataset has the following characteristics:
- English - Catalan only
- Contains 1000 translations from the GNOME UI and documentation projects
- Includes 5% translation errors (it is imbalanced), which have been reviewed and corrected by humans
Prompts are in config/ directory.
Models discarded
| model | version | comment | tp | fn | fp | tn | precision | recall | f1 | time |
|---|---|---|---|---|---|---|---|---|---|---|
| qwen3 | 1 | This is the baseline | 37 | 14 | 438 | 511 | 0.08 | 0.73 | 0.14 | 55105 |
| mistral | 1 | This is the baseline | 50 | 1 | 787 | 162 | 0.06 | 0.98 | 0.11 | 64164 |
Model selected
| model | version | comment | tp | fn | fp | tn | precision | recall | f1 | time |
|---|---|---|---|---|---|---|---|---|---|---|
| gemma3 | 1 | This is the baseline | 16 | 35 | 33 | 916 | 0.33 | 0.31 | 0.32 | 3476 |
| gemma3 | 2 | Pure instructions prompt | 18 | 33 | 50 | 899 | 0.26 | 0.35 | 0.3 | 3732 |
| gemma3 | 2_1 | Pure instructions prompt v2.1 | 19 | 32 | 56 | 893 | 0.25 | 0.37 | 0.3 | 3701 |
| gemma3 | 3 | Prompt with samples | 14 | 37 | 41 | 908 | 0.25 | 0.27 | 0.26 | 4156 |
| gemma3 | 3_1 | Prompt with samples v3.1 | 8 | 43 | 12 | 937 | 0.4 | 0.16 | 0.23 | 3100 |
| gemma3 | 4 | Super simple prompt | 34 | 17 | 210 | 739 | 0.14 | 0.67 | 0.23 | 8995 |
| gemma3 | 5 | Categorization prompt | 33 | 18 | 214 | 735 | 0.13 | 0.65 | 0.22 | 5974 |
Notes:
- Gemma 3 is Gemma 3 27B model quantified at 8 bits
- This evaluation is done over 1000 strings of which 5.10% contain errors and 94.90% are correct.
For reference, these are the metrics of some commercial models:
| model | version | comment | tp | fn | fp | tn | precision | recall | f1 | time |
|---|---|---|---|---|---|---|---|---|---|---|
| gpt-5 | 1 | Default prompt description | 11 | 40 | 11 | 938 | 0.5 | 0.22 | 0.3 | 5424 |
| gpt-5-mini | 1 | Default prompt description | 13 | 38 | 22 | 927 | 0.37 | 0.25 | 0.3 | 10528 |
| gemini-2.5-flash | 1 | Default prompt description | 13 | 38 | 18 | 931 | 0.42 | 0.25 | 0.32 | 5573 |
| gemini-2.5-pro | 1 | Default prompt description | 14 | 37 | 17 | 932 | 0.45 | 0.27 | 0.34 | 11902 |
Legend:
- version: version of the prompt
- comment: comment that describes the prompt
- tp: true positive
- fp: false positive
- fn: false negative
- tn: true negative
If you are not familiar with these concepts, check the confusion matrix at Wikipedia.
Our current recommendation is Gemma 3 27B with prompt version 1.
If you have a file in PO format that you want to review, follow these instructions.
- Install the necessary dependencies:
pip install -r evaluator/requirements-evaluator.txt- Download the model
make download-models- Run it in your own PO file:
python evaluator/inference.py --input FILE.poThe output is a FILE.txt with all the detected errors. Expect the system to generate a large amount of false positives but the true positives are very useful.