OPUS - OpenSubtitles development and testdata

A repository of aligned subtitles from OpenSubtitles to be used as development and test data in machine translation. Subtitles from 2024 have been reserved to be heldout data for development and test data.

The essential resources are the following:

Bilingual test sets: OpenSubtitles2024-testset.zip
Bilingual development sets: OpenSubtitles2024-devset.zip
Multilingual test set: ar-bg-cs-da-de-el-en-es-et-fa-fi-fr-he-hi-hr-hu-id-it-ko-lt-lv-ms-nl-no-pl-pt-pt_BR-ro-ru-sk-sl-sr-sv-ta-te-tr-uk-vi-zh_CN-zh_TW.zip

Note: The zipfiles of the aligned plain text files are password protected to avoid crawlers to include the testsets (at least in this aligned form) in potential training data. The password is the same as the file name without the file extension .zip. For the multilingual test set the password is OpenSubtitles2024-multiset.

Some more details about the data sets and how they have been created are given below.

Alignment scores

Alignment scores are computed for each pair of subtitles from a movie/series-episode. Ths score gives the proportion of non-empty alignments assuming that subtitles that align without gaps are better aligned than subtitle pairs with a lot of empty sentence alignments (i.e. text that does not have a corresponding translation with an overlapping time slot). This score is used to select high-quality subsitles in the test sets below.

Bilingual testsets

Bilingual testsets are not multiparallel (i.e. do not cover the same movies for each language pair) and have been extracted to include at least one movie/series-episode and at most 5 movies/series-episodes per language pair. Alignment scores need to be above 0.8 and the movies are selected to have the best alignment score.

Testsets in aligned plain text format: Zipfile of all aligned plain text files with sentences on corresponding lines (Moses format).
Testset Sentence Alignment in XML: Sentence alignments as standoff annotation in XCES Align format (xx-yy.xml.gz files with xx and yy being source and target language ID's; xx-yy.xml.gz.scores list alignment scores for the selected subtitle pairs)
Subtitle XML files (untokenized): Subtitle files in XML format (xx.zip with xx being a language ID); Files can be downloaded from https://object.pouta.csc.fi/OPUS-OpenSubtitle-devtest/devtest-raw/xx.zip (replacing xx with the language ID of interest)
Subtitle XML files (tokenized): Tokenized subtitle files in XML format (xx.zip with xx being a language ID) Files can be downloaded from https://object.pouta.csc.fi/OPUS-OpenSubtitle-devtest/devtest-xml/xx.zip (replacing xx with the language ID of interest)

Non-selected subtitle files are available as aligned development data:

Devsets in aligned plain text format: Zipfile of all aligned plain text files with sentences on corresponding lines (Moses format).
Devset Sentence Alignment in XML: Sentence alignments as standoff annotation in XCES Align format (xx-yy.xml.gz files with xx and yy being source and target language ID's; xx-yy.xml.gz.scores list alignment scores for the selected subtitle pairs)

The subtitles in XML format are all includes in the language-specific zip-files (see testsets above)

Note: The zipfiles of the aligned plain text files are password protected to avoid crawlers to include the testsets (at least in this aligned form) in potential training data. The password is the same as the file name without the file extension .zip.

Multilingual testsets

Multilingual testsets corersponds to sets of multi-way parallel test data in which all subtitles are covered for all selected movies/series-episodes for all languages included in the testset. The alignments are entirely synchronized across all languages involved. We extracted a dataset that covers 40 languages and language variants and a selection of 16 subtitle files:

opensubtitles2024-multitest

The zip-file contains sentence alignment files in standoff XCES Align annotation (langset/movieID/xx-yy.xml with langset replaced by the set of languages in the set, movieID referring to the movie/series that is covered by the substitles, and xx and yy referring to source and target language codes) and aligned plain text files for each movie/series in the testset. The languages included in the data set are: ar bg cs da de el en es et fa fi fr he hi hr hu id it ko lt lv ms nl no pl pt pt_BR ro ru sk sl sr sv ta te tr uk vi zh_CN zh_TW

The plain text files are aligned across all languages in the text with corresponding text on identical lines in each subtitle file.

Note: The zipfiles are password protected to avoid crawlers to include the testsets (at least in this aligned form) in potential training data. The password is OpenSubtitles2024-multiset.

Besides of this selected test set, we also provide alternative sets that have been extracted from OpenSubtitles2024. Those test sets have different kinds of language and subtitle coverage and are also based on different alignment thresholds. All download links are available from the following sub pages:

The datasets have been extracted with alignment thresholds 0.8, 0.9 and no alignment threshold (= all). Each dataset is distributed in a separate zipfile.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
devtest-raw		devtest-raw
devtest-xml		devtest-xml
scripts		scripts
Makefile		Makefile
Makefile.def		Makefile.def
Makefile.submit		Makefile.submit
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OPUS - OpenSubtitles development and testdata

Alignment scores

Bilingual testsets

Multilingual testsets

About

Uh oh!

Releases

Packages

Languages

Helsinki-NLP/OpenSubtitles-devtest

Folders and files

Latest commit

History

Repository files navigation

OPUS - OpenSubtitles development and testdata

Alignment scores

Bilingual testsets

Multilingual testsets

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages