-
-
Notifications
You must be signed in to change notification settings - Fork 645
feat: infer stream language from TMDB translations #482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: infer stream language from TMDB translations #482
Conversation
WalkthroughThis PR extends the metadata pipeline to carry language-tagged titles from TMDB and introduces language inference logic that uses these titles to populate missing language information for streams whose filenames match known translations. Changes
Sequence DiagramsequenceDiagram
participant Filterer as Stream Filterer
participant TMDB as TMDB Metadata
participant Infer as Language Inferrer
participant Stream as Stream Object
Filterer->>TMDB: getMetadata() with titlesWithLanguages
TMDB-->>Filterer: Metadata with titlesWithLanguages[]
loop For each stream with Unknown/empty language
Filterer->>Stream: Check parsedFile.languages
alt Language is Unknown or empty
Filterer->>Infer: inferLanguageFromTitle(filename, titlesWithLanguages)
Infer->>Infer: Normalize filename and titles
Infer->>Infer: Match filename against titles
alt Single language match
Infer-->>Filterer: Language name (mapped from ISO code)
else Multiple language matches
Infer-->>Filterer: "Multi"
else No match
Infer-->>Filterer: Existing language or Unknown
end
Filterer->>Stream: Update parsedFile.languages with inferred value
Filterer->>Filterer: Increment inferred count
end
end
Filterer->>Filterer: Log aggregated inference results
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
packages/core/src/metadata/tmdb.ts(6 hunks)packages/core/src/metadata/utils.ts(1 hunks)packages/core/src/streams/filterer.ts(1 hunks)packages/core/src/utils/languages.ts(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
packages/core/src/streams/filterer.ts (1)
packages/core/src/utils/languages.ts (1)
inferLanguageFromTitle(2018-2103)
packages/core/src/metadata/tmdb.ts (1)
packages/core/src/metadata/utils.ts (1)
TitleWithLanguage(1-6)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test
| return title | ||
| .toLowerCase() | ||
| .replace(/[^\p{L}\p{N}]/gu, ' ') | ||
| .replace(/\s+/g, ' ') | ||
| .trim(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strip accents when normalising titles
normalizeTitle lowercases and removes punctuation, but it keeps diacritic marks. As a result a filename such as Amelie.2001.mkv normalises to amelie, while the TMDB translation Amélie normalises to amélie, so the subsequent includes test never matches and we fail to infer the language. This hits a large share of non-English titles (French, Spanish, Portuguese, etc.), undermining the feature. Please strip combining marks before the comparison.
Apply this diff to fix it:
function normalizeTitle(title: string): string {
- return title
- .toLowerCase()
+ return title
+ .normalize('NFD')
+ .replace(/[\u0300-\u036f]/g, '')
+ .toLowerCase()
.replace(/[^\p{L}\p{N}]/gu, ' ')
.replace(/\s+/g, ' ')
.trim();
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| return title | |
| .toLowerCase() | |
| .replace(/[^\p{L}\p{N}]/gu, ' ') | |
| .replace(/\s+/g, ' ') | |
| .trim(); | |
| return title | |
| .normalize('NFD') | |
| .replace(/[\u0300-\u036f]/g, '') | |
| .toLowerCase() | |
| .replace(/[^\p{L}\p{N}]/gu, ' ') | |
| .replace(/\s+/g, ' ') | |
| .trim(); |
🤖 Prompt for AI Agents
In packages/core/src/utils/languages.ts around lines 2003 to 2007,
normalizeTitle currently lowercases and removes punctuation but leaves diacritic
marks; update the normalization chain to first apply Unicode NFD decomposition
and then strip all combining marks (remove \p{M} characters) before continuing
with the existing punctuation/whitespace replaces so accents are removed (e.g.
use .normalize('NFD') followed by .replace(/\p{M}/gu, '') then the current
.replace(/[^\p{L}\p{N}]/gu, ' ').replace(/\s+/g, ' ').trim()).
Summary by CodeRabbit
New Features
Improvements