-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Dynamic similarity threshold based on string length for Translation Memory lookups #18698
Copy link
Copy link
Closed
Closed
Copy link
Labels
Area: Automatic translationIssues related to automatic translations, automatic suggestions, fuzzy matching etc.Issues related to automatic translations, automatic suggestions, fuzzy matching etc.Waiting for: ReleaseThe issue is fixed and waiting to be released.The issue is fixed and waiting to be released.
Milestone
Description
Describe the problem
When using Weblate with a large Translation Memory database, TM lookups are very slow for short strings, even on a powerful server.
Server specs:
- 27 vCPU, 39 GB RAM, NVMe storage
- PostgreSQL 16,
shared_buffers = 12.5 GB,work_mem = 128 MB - Weblate
bleeding(latest)
Translation Memory: 3,392,954 records
String length distribution:
| Length | Count | % of total |
|---|---|---|
| < 30 chars | 1,492,878 | 44% |
| 30–100 chars | 1,016,353 | 30% |
| > 100 chars | 883,731 | 26% |
Average string length: 85 characters
EXPLAIN ANALYZE results (pg_trgm.similarity_threshold = 0.5):
| String | Length | Trigram candidates | Execution time |
|---|---|---|---|
| "test string" | 11 chars | 260,406 | 7,766 ms |
| "You have completed the quest" | 28 chars | 157,210 | 3,046 ms |
| "Aldmeri Dominion" | 16 chars | 24,298 | 189 ms |
Short strings are 40× slower than long strings. At 50% similarity threshold, a 10-character string needs very few matching trigrams to qualify, resulting in hundreds of thousands of false positives rechecked against the heap.
Workarounds attempted with no improvement:
work_memincreased to 128 MBREINDEX CONCURRENTLY memory_source_trgmVACUUM ANALYZE memory_memoryeffective_io_concurrency = 200
Solution brainstorm
Allow configuring similarity threshold as a function of source string length in settings.py:
Option A — length-based tiers:
SIMILARITY_THRESHOLD_SHORT = 0.85 # strings < 30 chars
SIMILARITY_THRESHOLD_MEDIUM = 0.65 # strings 30–100 chars
SIMILARITY_THRESHOLD_LONG = 0.50 # strings > 100 chars
Option B — formula-based:
# threshold = max(MIN_SIMILARITY, 0.95 - (length / 300))
This way:
Short strings (UI labels, "Yes"/"No", button names) → high threshold (80–90%), only near-exact matches are useful anyway
Long strings (quest descriptions, dialogue) → lower threshold (50–60%), useful to catch sentences with a few changed words
Describe alternatives you have considered
Raising the global MINIMUM_SIMILARITY threshold uniformly — degrades quality for long strings where lower similarity is genuinely useful (catching rephrased sentences)
PostgreSQL partial indexes per length range — PostgreSQL optimizer does not guarantee which index is chosen, unreliable
Disabling shared Translation Memory — not acceptable, we intentionally share TM across game localization projects that share significant vocabulary
Screenshots
No response
Additional context
No responseReactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Area: Automatic translationIssues related to automatic translations, automatic suggestions, fuzzy matching etc.Issues related to automatic translations, automatic suggestions, fuzzy matching etc.Waiting for: ReleaseThe issue is fixed and waiting to be released.The issue is fixed and waiting to be released.