feat: infer stream language from TMDB translations #482

worldInColors · 2025-11-10T17:01:40Z

Compare unknown language streams against TMDB title translations
Prioritize explicit language tags in filenames
Handle multi-language and ambiguous titles appropriately
Resolves feature request: Infer Stream Language from Filename #436

Summary by CodeRabbit

New Features
- Metadata now includes language-tagged title information for improved multilingual content handling.
- Automatic language detection system that infers stream languages from alternative titles when primary language data is unavailable, enhancing accuracy for international content.
Improvements
- Extended metadata architecture to support language-aware title associations throughout the system.

coderabbitai · 2025-11-10T17:01:52Z

Walkthrough

This PR extends the metadata pipeline to carry language-tagged titles from TMDB and introduces language inference logic that uses these titles to populate missing language information for streams whose filenames match known translations.

Changes

Cohort / File(s)	Summary
Type definitions `packages/core/src/metadata/utils.ts`	Added new `TitleWithLanguage` interface with title, ISO 639-1, optional ISO 3166-1, and english_name fields; extended `Metadata` interface to include optional `titlesWithLanguages` property.
TMDB metadata fetching `packages/core/src/metadata/tmdb.ts`	Updated `fetchAlternativeTitles` and `fetchTranslatedTitles` to return objects containing both plain titles and `titlesWithLanguages` arrays; extended `getMetadata` to aggregate and propagate language-tagged titles into returned metadata.
Language inference engine `packages/core/src/utils/languages.ts`	Introduced new exported `inferLanguageFromTitle` function that matches filenames against TMDB title-language mappings, handling single-language matches, multi-language collisions (returns "Multi"), and regional variants.
Stream language population `packages/core/src/streams/filterer.ts`	Added language inference pass that populates stream languages for streams with empty or Unknown language using TMDB-derived `titlesWithLanguages` and the new inference function; logs per-stream and aggregated results.

Sequence Diagram

sequenceDiagram
    participant Filterer as Stream Filterer
    participant TMDB as TMDB Metadata
    participant Infer as Language Inferrer
    participant Stream as Stream Object

    Filterer->>TMDB: getMetadata() with titlesWithLanguages
    TMDB-->>Filterer: Metadata with titlesWithLanguages[]
    
    loop For each stream with Unknown/empty language
        Filterer->>Stream: Check parsedFile.languages
        alt Language is Unknown or empty
            Filterer->>Infer: inferLanguageFromTitle(filename, titlesWithLanguages)
            Infer->>Infer: Normalize filename and titles
            Infer->>Infer: Match filename against titles
            alt Single language match
                Infer-->>Filterer: Language name (mapped from ISO code)
            else Multiple language matches
                Infer-->>Filterer: "Multi"
            else No match
                Infer-->>Filterer: Existing language or Unknown
            end
            Filterer->>Stream: Update parsedFile.languages with inferred value
            Filterer->>Filterer: Increment inferred count
        end
    end
    
    Filterer->>Filterer: Log aggregated inference results

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Language inference logic (languages.ts): Requires careful review of filename/title normalisation, matching logic, and handling of edge cases (multi-language, regional variants, ambiguous titles).
TMDB integration (tmdb.ts): Verify return type consistency across fetchAlternativeTitles and fetchTranslatedTitles, and aggregation logic in getMetadata.
Stream filterer application (filterer.ts): Confirm inference is only applied when language is Unknown/empty and that logging accurately reflects operations.

Poem

🐰 A rabbit hops through translated tongues,
Matching filenames to TMDB's songs,
French titles bloom, Spanish arise—
Unknown streams now meet knowing eyes!
Multi-lingual dreams, no more in disguise! ✨

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarises the main feature: inferring stream language from TMDB translations, which is the core change across all modified files.
Linked Issues check	✅ Passed	The pull request implements all primary coding requirements from issue #436: language inference from titles, fallback handling, multi-language detection, regional variant consolidation, and ambiguous title filtering.
Out of Scope Changes check	✅ Passed	All changes directly support the language inference feature: data structures in utils.ts, TMDB metadata enrichment in tmdb.ts, filterer integration, and language utility functions. No unrelated modifications detected.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8184442 and 2e7fbf3.

📒 Files selected for processing (4)

packages/core/src/metadata/tmdb.ts (6 hunks)
packages/core/src/metadata/utils.ts (1 hunks)
packages/core/src/streams/filterer.ts (1 hunks)
packages/core/src/utils/languages.ts (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

packages/core/src/streams/filterer.ts (1)

packages/core/src/utils/languages.ts (1)

inferLanguageFromTitle (2018-2103)

packages/core/src/metadata/tmdb.ts (1)

packages/core/src/metadata/utils.ts (1)

TitleWithLanguage (1-6)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: test

coderabbitai · 2025-11-10T17:07:01Z

packages/core/src/utils/languages.ts

+  return title
+    .toLowerCase()
+    .replace(/[^\p{L}\p{N}]/gu, ' ')
+    .replace(/\s+/g, ' ')
+    .trim();


⚠️ Potential issue | 🟠 Major

Strip accents when normalising titles

normalizeTitle lowercases and removes punctuation, but it keeps diacritic marks. As a result a filename such as Amelie.2001.mkv normalises to amelie, while the TMDB translation Amélie normalises to amélie, so the subsequent includes test never matches and we fail to infer the language. This hits a large share of non-English titles (French, Spanish, Portuguese, etc.), undermining the feature. Please strip combining marks before the comparison.

Apply this diff to fix it:

function normalizeTitle(title: string): string { - return title - .toLowerCase() + return title + .normalize('NFD') + .replace(/[\u0300-\u036f]/g, '') + .toLowerCase() .replace(/[^\p{L}\p{N}]/gu, ' ') .replace(/\s+/g, ' ') .trim(); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

return title

.toLowerCase()

.replace(/[^\p{L}\p{N}]/gu, ' ')

.replace(/\s+/g, ' ')

.trim();

return title

.normalize('NFD')

.replace(/[\u0300-\u036f]/g, '')

.toLowerCase()

.replace(/[^\p{L}\p{N}]/gu, ' ')

.replace(/\s+/g, ' ')

.trim();

🤖 Prompt for AI Agents

In packages/core/src/utils/languages.ts around lines 2003 to 2007, normalizeTitle currently lowercases and removes punctuation but leaves diacritic marks; update the normalization chain to first apply Unicode NFD decomposition and then strip all combining marks (remove \p{M} characters) before continuing with the existing punctuation/whitespace replaces so accents are removed (e.g. use .normalize('NFD') followed by .replace(/\p{M}/gu, '') then the current .replace(/[^\p{L}\p{N}]/gu, ' ').replace(/\s+/g, ' ').trim()).

feat: infer stream language from TMDB translations

2e7fbf3

coderabbitai bot reviewed Nov 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: infer stream language from TMDB translations #482

feat: infer stream language from TMDB translations #482

worldInColors commented Nov 10, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 10, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

feat: infer stream language from TMDB translations #482

Are you sure you want to change the base?

feat: infer stream language from TMDB translations #482

Conversation

worldInColors commented Nov 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

worldInColors commented Nov 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 10, 2025 •

edited

Loading