-
-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Description
A solution would be to remove the tokenizer from the Translator and add it as an attribute to translate_batch & translate_batch_with_target_prefix.
Something like this would be usable. wrapping from in Arc<Mutex<?>> would be possible, but is not optimal in case it runs at the same time
impl Tokenizer for MyTokenizer {
fn encode(&self, input: &str) -> anyhow::Result<Vec<String>> {
let mut encoded = self.tokenizer.encode(input)?;
encoded.insert(0, self.from.clone());
Ok(encoded)
}
fn decode(&self, tokens: Vec<String>) -> anyhow::Result<String> {
self.tokenizer.decode(tokens)
}
}import ctranslate2
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("spm.128k.model")
source = ["__en__"] + sp.encode("Hello world!", out_type=str)
target_prefix = ["__de__"]
translator = ctranslate2.Translator("m2m_100_418m_ct2")
result = translator.translate_batch([source], target_prefix=[target_prefix])
output = sp.decode(result[0].hypotheses[0][1:])
print(output)Metadata
Metadata
Assignees
Labels
No labels