A repository of aligned subtitles from OpenSubtitles to be used as development and test data in machine translation. Subtitles from 2024 have been reserved to be heldout data for development and test data.
The essential resources are the following:
- Bilingual test sets: OpenSubtitles2024-testset.zip
- Bilingual development sets: OpenSubtitles2024-devset.zip
- Multilingual test set: ar-bg-cs-da-de-el-en-es-et-fa-fi-fr-he-hi-hr-hu-id-it-ko-lt-lv-ms-nl-no-pl-pt-pt_BR-ro-ru-sk-sl-sr-sv-ta-te-tr-uk-vi-zh_CN-zh_TW.zip
Note: The zipfiles of the aligned plain text files are password protected to avoid crawlers to include the testsets (at least in this aligned form) in potential training data. The password is the same as the file name without the file extension .zip. For the multilingual test set the password is OpenSubtitles2024-multiset.
Some more details about the data sets and how they have been created are given below.
Alignment scores are computed for each pair of subtitles from a movie/series-episode. Ths score gives the proportion of non-empty alignments assuming that subtitles that align without gaps are better aligned than subtitle pairs with a lot of empty sentence alignments (i.e. text that does not have a corresponding translation with an overlapping time slot). This score is used to select high-quality subsitles in the test sets below.
Bilingual testsets are not multiparallel (i.e. do not cover the same movies for each language pair) and have been extracted to include at least one movie/series-episode and at most 5 movies/series-episodes per language pair. Alignment scores need to be above 0.8 and the movies are selected to have the best alignment score.
- Testsets in aligned plain text format: Zipfile of all aligned plain text files with sentences on corresponding lines (Moses format).
- Testset Sentence Alignment in XML: Sentence alignments as standoff annotation in XCES Align format (
xx-yy.xml.gzfiles withxxandyybeing source and target language ID's;xx-yy.xml.gz.scoreslist alignment scores for the selected subtitle pairs) - Subtitle XML files (untokenized): Subtitle files in XML format (
xx.zipwithxxbeing a language ID); Files can be downloaded fromhttps://object.pouta.csc.fi/OPUS-OpenSubtitle-devtest/devtest-raw/xx.zip(replacingxxwith the language ID of interest) - Subtitle XML files (tokenized): Tokenized subtitle files in XML format (
xx.zipwithxxbeing a language ID) Files can be downloaded fromhttps://object.pouta.csc.fi/OPUS-OpenSubtitle-devtest/devtest-xml/xx.zip(replacingxxwith the language ID of interest)
Non-selected subtitle files are available as aligned development data:
- Devsets in aligned plain text format: Zipfile of all aligned plain text files with sentences on corresponding lines (Moses format).
- Devset Sentence Alignment in XML: Sentence alignments as standoff annotation in XCES Align format (
xx-yy.xml.gzfiles withxxandyybeing source and target language ID's;xx-yy.xml.gz.scoreslist alignment scores for the selected subtitle pairs)
The subtitles in XML format are all includes in the language-specific zip-files (see testsets above)
Note: The zipfiles of the aligned plain text files are password protected to avoid crawlers to include the testsets (at least in this aligned form) in potential training data. The password is the same as the file name without the file extension .zip.
Multilingual testsets corersponds to sets of multi-way parallel test data in which all subtitles are covered for all selected movies/series-episodes for all languages included in the testset. The alignments are entirely synchronized across all languages involved. We extracted a dataset that covers 40 languages and language variants and a selection of 16 subtitle files:
The zip-file contains sentence alignment files in standoff XCES Align annotation (langset/movieID/xx-yy.xml with langset replaced by the set of languages in the set, movieID referring to the movie/series that is covered by the substitles, and xx and yy referring to source and target language codes) and aligned plain text files for each movie/series in the testset. The languages included in the data set are: ar bg cs da de el en es et fa fi fr he hi hr hu id it ko lt lv ms nl no pl pt pt_BR ro ru sk sl sr sv ta te tr uk vi zh_CN zh_TW
The plain text files are aligned across all languages in the text with corresponding text on identical lines in each subtitle file.
Note: The zipfiles are password protected to avoid crawlers to include the testsets (at least in this aligned form) in potential training data. The password is OpenSubtitles2024-multiset.
Besides of this selected test set, we also provide alternative sets that have been extracted from OpenSubtitles2024. Those test sets have different kinds of language and subtitle coverage and are also based on different alignment thresholds. All download links are available from the following sub pages:
- linksets with alignment threshold 0.9
- linksets with alignment threshold 0.8
- linksets with no alignment threshold
The datasets have been extracted with alignment thresholds 0.8, 0.9 and no alignment threshold (= all). Each dataset is distributed in a separate zipfile.