Introduction

The goal of this repository is to evaluate and develop prompts for LLMs that can be used to review English-to-Catalan translations in the software localization domain.

This work is used in real world scenarios in the context of reviewing open source translations.

This repository includes three main components:

Evaluation dataset
Tool to evaluate LLMs and prompts against the dataset
Collection of winning prompts for the task

This work has been used to review the Catalan translations of the GNOME project.

Dataset

Dataset is located at dataset/dataset.tmx.

The dataset has the following characteristics:

English - Catalan only
Contains 1000 translations from the GNOME UI and documentation projects
Includes 5% translation errors (it is imbalanced), which have been reviewed and corrected by humans

Evaluation of different models prompts

Prompts are in config/ directory.

Models discarded

model	version	comment	tp	fn	fp	tn	precision	recall	f1	time
qwen3	1	This is the baseline	37	14	438	511	0.08	0.73	0.14	55105
mistral	1	This is the baseline	50	1	787	162	0.06	0.98	0.11	64164

Model selected

model	version	comment	tp	fn	fp	tn	precision	recall	f1	time
gemma3	1	This is the baseline	16	35	33	916	0.33	0.31	0.32	3476
gemma3	2	Pure instructions prompt	18	33	50	899	0.26	0.35	0.3	3732
gemma3	2_1	Pure instructions prompt v2.1	19	32	56	893	0.25	0.37	0.3	3701
gemma3	3	Prompt with samples	14	37	41	908	0.25	0.27	0.26	4156
gemma3	3_1	Prompt with samples v3.1	8	43	12	937	0.4	0.16	0.23	3100
gemma3	4	Super simple prompt	34	17	210	739	0.14	0.67	0.23	8995
gemma3	5	Categorization prompt	33	18	214	735	0.13	0.65	0.22	5974

Notes:

Gemma 3 is Gemma 3 27B model quantified at 8 bits
This evaluation is done over 1000 strings of which 5.10% contain errors and 94.90% are correct.

For reference, these are the metrics of some commercial models:

model	version	comment	tp	fn	fp	tn	precision	recall	f1	time
gpt-5	1	Default prompt description	11	40	11	938	0.5	0.22	0.3	5424
gpt-5-mini	1	Default prompt description	13	38	22	927	0.37	0.25	0.3	10528
gemini-2.5-flash	1	Default prompt description	13	38	18	931	0.42	0.25	0.32	5573
gemini-2.5-pro	1	Default prompt description	14	37	17	932	0.45	0.27	0.34	11902

Legend:

version: version of the prompt
comment: comment that describes the prompt
tp: true positive
fp: false positive
fn: false negative
tn: true negative

If you are not familiar with these concepts, check the confusion matrix at Wikipedia.

Using the system to review your translation

Our current recommendation is Gemma 3 27B with prompt version 1.

If you have a file in PO format that you want to review, follow these instructions.

Install the necessary dependencies:

pip install -r evaluator/requirements-evaluator.txt

Download the model

make download-models

Run it in your own PO file:

python evaluator/inference.py --input FILE.po

The output is a FILE.txt with all the detected errors. Expect the system to generate a large amount of false positives but the true positives are very useful.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
config		config
dataset		dataset
evaluator		evaluator
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Dataset

Evaluation of different models prompts

Using the system to review your translation

About

Uh oh!

Releases

Packages

Languages

jordimas/SoftTransEval-CAT

Folders and files

Latest commit

History

Repository files navigation

Introduction

Dataset

Evaluation of different models prompts

Using the system to review your translation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages