Skip to content

Dynamic similarity threshold based on string length for Translation Memory lookups #18698

@tomkolp

Description

@tomkolp

Describe the problem

When using Weblate with a large Translation Memory database, TM lookups are very slow for short strings, even on a powerful server.

Server specs:

  • 27 vCPU, 39 GB RAM, NVMe storage
  • PostgreSQL 16, shared_buffers = 12.5 GB, work_mem = 128 MB
  • Weblate bleeding (latest)

Translation Memory: 3,392,954 records

String length distribution:

Length Count % of total
< 30 chars 1,492,878 44%
30–100 chars 1,016,353 30%
> 100 chars 883,731 26%

Average string length: 85 characters

EXPLAIN ANALYZE results (pg_trgm.similarity_threshold = 0.5):

String Length Trigram candidates Execution time
"test string" 11 chars 260,406 7,766 ms
"You have completed the quest" 28 chars 157,210 3,046 ms
"Aldmeri Dominion" 16 chars 24,298 189 ms

Short strings are 40× slower than long strings. At 50% similarity threshold, a 10-character string needs very few matching trigrams to qualify, resulting in hundreds of thousands of false positives rechecked against the heap.

Workarounds attempted with no improvement:

  • work_mem increased to 128 MB
  • REINDEX CONCURRENTLY memory_source_trgm
  • VACUUM ANALYZE memory_memory
  • effective_io_concurrency = 200

Solution brainstorm

Allow configuring similarity threshold as a function of source string length in settings.py:

Option A — length-based tiers:

SIMILARITY_THRESHOLD_SHORT = 0.85   # strings < 30 chars
SIMILARITY_THRESHOLD_MEDIUM = 0.65  # strings 30–100 chars
SIMILARITY_THRESHOLD_LONG = 0.50    # strings > 100 chars
Option Bformula-based:
# threshold = max(MIN_SIMILARITY, 0.95 - (length / 300))
This way:
Short strings (UI labels, "Yes"/"No", button names) → high threshold (8090%), only near-exact matches are useful anyway
Long strings (quest descriptions, dialogue) → lower threshold (5060%), useful to catch sentences with a few changed words
Describe alternatives you have considered
Raising the global MINIMUM_SIMILARITY threshold uniformlydegrades quality for long strings where lower similarity is genuinely useful (catching rephrased sentences)
PostgreSQL partial indexes per length rangePostgreSQL optimizer does not guarantee which index is chosen, unreliable
Disabling shared Translation Memorynot acceptable, we intentionally share TM across game localization projects that share significant vocabulary
Screenshots
No response
Additional context
No response

Metadata

Metadata

Assignees

Labels

Area: Automatic translationIssues related to automatic translations, automatic suggestions, fuzzy matching etc.Waiting for: ReleaseThe issue is fixed and waiting to be released.

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions