Skip to content

m2m-100 Model custom first token #97

@frederik-uni

Description

@frederik-uni

A solution would be to remove the tokenizer from the Translator and add it as an attribute to translate_batch & translate_batch_with_target_prefix.

Something like this would be usable. wrapping from in Arc<Mutex<?>> would be possible, but is not optimal in case it runs at the same time

impl Tokenizer for MyTokenizer {
    fn encode(&self, input: &str) -> anyhow::Result<Vec<String>> {
        let mut encoded = self.tokenizer.encode(input)?;
        encoded.insert(0, self.from.clone());
        Ok(encoded)
    }

    fn decode(&self, tokens: Vec<String>) -> anyhow::Result<String> {
        self.tokenizer.decode(tokens)
    }
}
import ctranslate2
import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("spm.128k.model")

source = ["__en__"] + sp.encode("Hello world!", out_type=str)
target_prefix = ["__de__"]

translator = ctranslate2.Translator("m2m_100_418m_ct2")
result = translator.translate_batch([source], target_prefix=[target_prefix])

output = sp.decode(result[0].hypotheses[0][1:])
print(output)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions