Dynamic similarity threshold based on string length for Translation Memory lookups

### Describe the problem

When using Weblate with a large Translation Memory database, TM lookups are very slow for short strings, even on a powerful server.

**Server specs:**
- 27 vCPU, 39 GB RAM, NVMe storage
- PostgreSQL 16, `shared_buffers = 12.5 GB`, `work_mem = 128 MB`
- Weblate `bleeding` (latest)

**Translation Memory: 3,392,954 records**

String length distribution:

| Length | Count | % of total |
|--------|-------|------------|
| < 30 chars | 1,492,878 | 44% |
| 30–100 chars | 1,016,353 | 30% |
| > 100 chars | 883,731 | 26% |

Average string length: **85 characters**

**EXPLAIN ANALYZE results** (`pg_trgm.similarity_threshold = 0.5`):

| String | Length | Trigram candidates | Execution time |
|--------|--------|--------------------|----------------|
| "test string" | 11 chars | 260,406 | 7,766 ms |
| "You have completed the quest" | 28 chars | 157,210 | 3,046 ms |
| "Aldmeri Dominion" | 16 chars | 24,298 | 189 ms |

Short strings are **40× slower** than long strings. At 50% similarity threshold, a 10-character string needs very few matching trigrams to qualify, resulting in hundreds of thousands of false positives rechecked against the heap.

Workarounds attempted with no improvement:
- `work_mem` increased to 128 MB
- `REINDEX CONCURRENTLY memory_source_trgm`
- `VACUUM ANALYZE memory_memory`
- `effective_io_concurrency = 200`

### Solution brainstorm

Allow configuring similarity threshold as a function of source string length in `settings.py`:

**Option A — length-based tiers:**
```python
SIMILARITY_THRESHOLD_SHORT = 0.85   # strings < 30 chars
SIMILARITY_THRESHOLD_MEDIUM = 0.65  # strings 30–100 chars
SIMILARITY_THRESHOLD_LONG = 0.50    # strings > 100 chars
Option B — formula-based:
# threshold = max(MIN_SIMILARITY, 0.95 - (length / 300))
This way:
Short strings (UI labels, "Yes"/"No", button names) → high threshold (80–90%), only near-exact matches are useful anyway
Long strings (quest descriptions, dialogue) → lower threshold (50–60%), useful to catch sentences with a few changed words
Describe alternatives you have considered
Raising the global MINIMUM_SIMILARITY threshold uniformly — degrades quality for long strings where lower similarity is genuinely useful (catching rephrased sentences)
PostgreSQL partial indexes per length range — PostgreSQL optimizer does not guarantee which index is chosen, unreliable
Disabling shared Translation Memory — not acceptable, we intentionally share TM across game localization projects that share significant vocabulary
Screenshots
No response
Additional context
No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dynamic similarity threshold based on string length for Translation Memory lookups #18698

Describe the problem

Solution brainstorm

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Length	Count	% of total
< 30 chars	1,492,878	44%
30–100 chars	1,016,353	30%
> 100 chars	883,731	26%

String	Length	Trigram candidates	Execution time
"test string"	11 chars	260,406	7,766 ms
"You have completed the quest"	28 chars	157,210	3,046 ms
"Aldmeri Dominion"	16 chars	24,298	189 ms

Uh oh!

Dynamic similarity threshold based on string length for Translation Memory lookups #18698

Description

Describe the problem

Solution brainstorm

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions