m2m-100 Model custom first token

A solution would be to remove the tokenizer from the Translator and add it as an attribute to translate_batch & translate_batch_with_target_prefix. 

Something like this would be usable. wrapping from in Arc<Mutex<?>> would be possible, but is not optimal in case it runs at the same time

```rs
impl Tokenizer for MyTokenizer {
    fn encode(&self, input: &str) -> anyhow::Result<Vec<String>> {
        let mut encoded = self.tokenizer.encode(input)?;
        encoded.insert(0, self.from.clone());
        Ok(encoded)
    }

    fn decode(&self, tokens: Vec<String>) -> anyhow::Result<String> {
        self.tokenizer.decode(tokens)
    }
}
```

```py
import ctranslate2
import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("spm.128k.model")

source = ["__en__"] + sp.encode("Hello world!", out_type=str)
target_prefix = ["__de__"]

translator = ctranslate2.Translator("m2m_100_418m_ct2")
result = translator.translate_batch([source], target_prefix=[target_prefix])

output = sp.decode(result[0].hypotheses[0][1:])
print(output)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

m2m-100 Model custom first token #97

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

m2m-100 Model custom first token #97

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions