Skip to content

Commit 92eeef9

Browse files
committed
Add detection for Tiktoken llama4 models
1 parent c043e27 commit 92eeef9

File tree

12 files changed

+2169989
-50
lines changed

12 files changed

+2169989
-50
lines changed

README.md

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,11 @@
77

88
**Tokenizer for language models.**
99

10-
<sup>**Tokenize text for Llama, Gemini, GPT-4, Mistral and many others; in the web, on the client and any platform.**</sup>
10+
<sup>**Tokenize text for Llama, Gemini, GPT-4, DeepSeek, Mistral and many others; in the web, on the client and any platform.**</sup>
1111

1212
```rust
1313
use kitoken::Kitoken;
14-
let encoder = Kitoken::from_file("models/llama3.kit")?;
14+
let encoder = Kitoken::from_file("models/llama4.kit")?;
1515

1616
let tokens = encoder.encode("Your future belongs to me.", true)?;
1717
let string = String::from_utf8(encoder.decode(&tokens, true)?)?;
@@ -26,8 +26,8 @@ Kitoken is a fast and versatile tokenizer for language models compatible with [S
2626
- **Fast and efficient tokenization**\
2727
Faster than most other tokenizers in both common and uncommon scenarios; see the [benchmarks](#benchmarks) for comparisons with different datasets.
2828
- **Runs in all environments**\
29-
Native in Rust and with bindings for Web, Node and Python; see [kitoken.dev](https://kitoken.dev) for a web demo.
30-
- **Support for normalization and pre-tokenization**\
29+
Native in Rust and with bindings for [Web](./packages/javascript), [Node](./packages/javascript) and [Python](./packages/python); see [kitoken.dev](https://kitoken.dev) for a web demo.
30+
- **Supports input and output processing**\
3131
Including unicode-aware normalization, pre-tokenization and post-processing options.
3232
- **Compact data format**\
3333
Definitions are stored in an efficient binary format and without merge list.
@@ -36,10 +36,15 @@ Kitoken is a fast and versatile tokenizer for language models compatible with [S
3636

3737
Kitoken can load and convert many existing tokenizer formats. Every supported format is [tested](./tests) against the original implementation across a variety of inputs to ensure correctness and compatibility.
3838

39+
> [!NOTE]
40+
> Most models on [Hugging Face](https://huggingface.co) are supported. Just take the `tokenizer.json` or `spiece.model` of a model and load it into Kitoken.
41+
42+
Kitoken aims to be output-identical with existing implementations for all models. See the notes below for differences in specific cases.
43+
3944
### SentencePiece
4045

4146
```rust
42-
let encoder = Kitoken::from_sentencepiece_file("models/mistral.model")?;
47+
let encoder = Kitoken::from_file("models/gemma.model")?;
4348
```
4449

4550
Kitoken can convert and initialize with SentencePiece models in `BPE` and `Unigram` format.
@@ -60,7 +65,7 @@ If the model does not contain a trainer definition, `Unigram` is assumed as the
6065
### Tokenizers
6166

6267
```rust
63-
let encoder = Kitoken::from_tokenizers_file("models/llama3.json")?;
68+
let encoder = Kitoken::from_file("models/llama4.json")?;
6469
```
6570

6671
Kitoken can convert and initialize with HuggingFace Tokenizers definitions for `BPE`, `Unigram` and `WordPiece` models.
@@ -76,29 +81,29 @@ Some normalization, post-processing and decoding options used by Tokenizers are
7681
<details>
7782
<summary>Notes</summary>
7883

79-
- When using a `BPE` definition with an incomplete vocabulary and without an `unk` token, Tokenizers skips over non-encodable pieces and attempts to merge the surrounding ones. Kitoken always considers non-encodable pieces as un-mergeable and encodes the surrounding pieces individually. This can result in different encodings depending on vocabulary coverage and inputs in this scenario.
84+
- When using a `BPE` definition with an incomplete vocabulary and without an `unk` token, Tokenizers skips over non-encodable pieces and attempts to merge the surrounding ones. Kitoken always considers non-encodable pieces as un-mergeable and encodes the surrounding pieces individually. This can affect models that exploit the behavior of Tokenizers with a deliberately restricted vocabulary.
8085
- Tokenizers normalizes inputs character-by-character, while Kitoken normalizes inputs as one. This can result in differences during case-folding in some cases. For example, greek letter `Σ` has two lowercase forms, `σ` for within-word and `ς` for end-of-word use. Tokenizers will always lowercase `Σ` to `σ`, while Kitoken will lowercase it to either depending on the context.
8186

8287
</details>
8388

8489
### Tiktoken
8590

8691
```rust
87-
let encoder = Kitoken::from_tiktoken_file("models/cl100k_base.tiktoken")?;
92+
let encoder = Kitoken::from_file("models/o200k_base.tiktoken")?;
8893
```
8994

90-
Tiktoken is a `BPE` tokenizer with a custom definition format used by OpenAI for GPT-3 and newer models using `BytePair` tokenization in byte mode.
95+
Tiktoken is a `BPE` tokenizer used by OpenAI for GPT-3 and newer models and uses `BytePair` tokenization in byte mode.
9196

9297
Tiktoken definitions contain a sorted vocabulary of base64 encoded bytes and corresponding token ids without any additional metadata. Special tokens and the split regex are expected to be provided separately, but will be inferred from the data for common models including GPT-3, GPT-4 and GPT-4o.
9398
For other models, or depending on the data and requirements, these values can be adjusted manually.
9499

95100
### Tekken
96101

97102
```rust
98-
let encoder = Kitoken::from_tekken_file("models/tekken.json")?;
103+
let encoder = Kitoken::from_file("models/mistral.json")?;
99104
```
100105

101-
Tekken is a `BPE` tokenizer with a custom definition format based on Tiktoken, used by Mistral for NeMo and newer models using `BytePair` tokenization in byte mode.
106+
Tekken is a `BPE` tokenizer based on Tiktoken, used by Mistral for NeMo and newer models and uses `BytePair` tokenization in byte mode.
102107

103108
Tekken definitions contain a sorted vocabulary of base64 encoded bytes and corresponding token ids, as well as metadata including the split regex and special tokens.
104109

packages/javascript/README.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,18 @@
11
# kitoken
22

3+
[![Crates.io](https://img.shields.io/crates/v/kitoken)](https://crates.io/crates/kitoken)
4+
[![NPM](https://img.shields.io/npm/v/kitoken)](https://www.npmjs.com/package/kitoken)
5+
[![PyPI](https://img.shields.io/pypi/v/kitoken)](https://pypi.org/project/kitoken)
6+
[![Tests & Checks](https://img.shields.io/github/actions/workflow/status/Systemcluster/kitoken/tests.yml?label=tests%20%26%20checks)](https://github.com/Systemcluster/kitoken/actions/workflows/tests.yml)
7+
38
**Tokenizer for language models.**
49

5-
<sup>**Tokenize text for Llama, Gemini, GPT-4, Mistral and many others; in the web, on the client and any platform.**</sup>
10+
<sup>**Tokenize text for Llama, Gemini, GPT-4, DeepSeek, Mistral and many others; in the web, on the client and any platform.**</sup>
611

712
```js
813
import { Kitoken } from "kitoken/node"
914

10-
const model = fs.readFileSync("models/llama3.3.model")
15+
const model = fs.readFileSync("models/llama4.model")
1116
const encoder = new Kitoken(model)
1217

1318
const tokens = encoder.encode("hello world!", true)
@@ -21,8 +26,8 @@ Kitoken is a fast and versatile tokenizer for language models compatible with [S
2126
- **Fast and efficient tokenization**\
2227
Faster than most other tokenizers in both common and uncommon scenarios; see the [benchmarks](//github.com/Systemcluster/kitoken#benchmarks) for comparisons with different datasets.
2328
- **Runs in all environments**\
24-
Native in Rust and with bindings for Web, Node and Python; see [kitoken.dev](https://kitoken.dev) for a web demo.
25-
- **Support for normalization and pre-tokenization**\
29+
Native in Rust and with bindings for [Web](./packages/javascript), [Node](./packages/javascript) and [Python](./packages/python); see [kitoken.dev](https://kitoken.dev) for a web demo.
30+
- **Supports input and output processing**\
2631
Including unicode-aware normalization, pre-tokenization and post-processing options.
2732
- **Compact data format**\
2833
Definitions are stored in an efficient binary format and without merge list.

packages/python/README.md

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,18 @@
11
# kitoken
22

3+
[![Crates.io](https://img.shields.io/crates/v/kitoken)](https://crates.io/crates/kitoken)
4+
[![NPM](https://img.shields.io/npm/v/kitoken)](https://www.npmjs.com/package/kitoken)
5+
[![PyPI](https://img.shields.io/pypi/v/kitoken)](https://pypi.org/project/kitoken)
6+
[![Tests & Checks](https://img.shields.io/github/actions/workflow/status/Systemcluster/kitoken/tests.yml?label=tests%20%26%20checks)](https://github.com/Systemcluster/kitoken/actions/workflows/tests.yml)
7+
38
**Tokenizer for language models.**
49

5-
<sup>**Tokenize text for Llama, Gemini, GPT-4, Mistral and many others; in the web, on the client and any platform.**</sup>
10+
<sup>**Tokenize text for Llama, Gemini, GPT-4, DeepSeek, Mistral and many others; in the web, on the client and any platform.**</sup>
611

712
```py
813
from kitoken import Kitoken
914

10-
encoder = Kitoken.from_file("models/llama3.3.model")
15+
encoder = Kitoken.from_file("models/llama4.model")
1116

1217
tokens = encoder.encode("hello world!", True)
1318
string = encoder.decode(tokens).decode("utf-8")
@@ -22,9 +27,9 @@ Kitoken is a fast and versatile tokenizer for language models compatible with [S
2227
- **Fast and efficient tokenization**\
2328
Faster than most other tokenizers in both common and uncommon scenarios; see the [benchmarks](//github.com/Systemcluster/kitoken#benchmarks) for comparisons with different datasets.
2429
- **Runs in all environments**\
25-
Native in Rust and with bindings for Web, Node and Python; see [kitoken.dev](https://kitoken.dev) for a web demo.
26-
- **Support for normalization and pre-tokenization**\
27-
Including unicode-aware normalization, pre-tokenization and post-processing options.
30+
Native in Rust and with bindings for [Web](./packages/javascript), [Node](./packages/javascript) and [Python](./packages/python); see [kitoken.dev](https://kitoken.dev) for a web demo.
31+
- **Supports input and output processing**\
32+
Including unicode-aware normalization, pre-tokenization and post-decoding options.
2833
- **Compact data format**\
2934
Definitions are stored in an efficient binary format and without merge list.
3035

src/convert/tiktoken.rs

Lines changed: 137 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,16 @@ use std::io::Read;
66
use std::path::Path;
77

88
use alloc::format;
9-
use alloc::string::ToString;
9+
use alloc::string::{String, ToString};
1010
use alloc::vec::Vec;
1111

1212
use base64::{alphabet, engine, Engine};
1313
use bstr::ByteSlice;
1414

1515
use crate::convert::ConversionError;
1616
use crate::{
17-
Configuration, Definition, Fallback, Kitoken, Metadata, Model, Regex, SpecialToken,
18-
SpecialTokenKind, SpecialVocab, Split, SplitBehavior, Vocab,
17+
Configuration, Definition, Fallback, InsertionPosition, Kitoken, Metadata, Model, Regex,
18+
SpecialToken, SpecialTokenKind, SpecialVocab, Split, SplitBehavior, Template, Vocab,
1919
};
2020

2121
static BASE64: engine::GeneralPurpose =
@@ -83,8 +83,91 @@ pub fn convert_tiktoken(data: impl AsRef<[u8]>) -> Result<Definition, Conversion
8383
let mut config = Configuration::default();
8484
config.fallback.push(Fallback::Skip);
8585

86-
let specials: &[(&str, u32)] = if vocab.len() >= 199990 {
87-
config.split.push(Split::Pattern { pattern:
86+
let mut specials = Vec::<(String, u32)>::with_capacity(2048);
87+
let reserved = move |name, count, start, pos| {
88+
(start..count + start)
89+
.enumerate()
90+
.map(move |(n, i)| (format!("<|{name}reserved_special_token_{i}|>"), (pos + n) as u32))
91+
};
92+
let sequential = move |list: &'static [&'static str], pos| {
93+
list.iter().enumerate().map(move |(n, s)| (s.to_string(), (pos + n) as u32))
94+
};
95+
match vocab.len() {
96+
len @ 200000 => {
97+
log::debug!("Detected llama4 vocab");
98+
config.split.push(Split::Pattern { pattern:
99+
Regex::new(&[
100+
r"[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?",
101+
r"[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?",
102+
r"\p{N}{1,3}",
103+
r" ?[^\s\p{L}\p{N}]+[\r\n/]*",
104+
r"\s*[\r\n]+",
105+
r"\s+(?!\S)",
106+
].join("|"))?.into(),
107+
behavior: SplitBehavior::Isolate
108+
});
109+
config.templates.push(Template {
110+
content: "<|begin_of_text|>".to_string(),
111+
position: InsertionPosition::SequenceStart,
112+
});
113+
config.templates.push(Template {
114+
content: "<|end_of_text|>".to_string(),
115+
position: InsertionPosition::SequenceEnd,
116+
});
117+
// Ref: https://github.com/meta-llama/llama-models/blob/main/models/llama4/tokenizer.py
118+
specials.extend(sequential(
119+
&[
120+
"<|begin_of_text|>",
121+
"<|end_of_text|>",
122+
"<|fim_prefix|>",
123+
"<|fim_middle|>",
124+
"<|fim_suffix|>",
125+
"<|header_start|>",
126+
"<|header_end|>",
127+
"<|eom|>",
128+
"<|eot|>",
129+
"<|step|>",
130+
],
131+
len,
132+
));
133+
specials.extend(reserved("text_post_train_", 6, 0, len + specials.len()));
134+
specials.extend(sequential(
135+
&[
136+
"<|python_start|>",
137+
"<|python_end|>",
138+
"<|finetune_right_pad|>",
139+
],
140+
len + specials.len(),
141+
));
142+
specials.extend(reserved("text_post_train_", 61, 8, len + specials.len()));
143+
specials.extend(sequential(
144+
&[
145+
"<|image_start|>",
146+
"<|image_end|>",
147+
"<|vision_reserved_special_token_0|>",
148+
"<|vision_reserved_special_token_1|>",
149+
"<|tile_x_separator|>",
150+
"<|tile_y_separator|>",
151+
"<|vision_reserved_special_token_2|>",
152+
"<|vision_reserved_special_token_3|>",
153+
"<|vision_reserved_special_token_4|>",
154+
"<|vision_reserved_special_token_5|>",
155+
"<|image|>",
156+
"<|vision_reserved_special_token_6|>",
157+
"<|patch|>",
158+
],
159+
len + specials.len(),
160+
));
161+
specials.extend(reserved("vision_", 1041, 7, len + specials.len()));
162+
specials.extend(reserved("reasoning_", 7, 0, len + specials.len()));
163+
specials.extend(sequential(
164+
&["<|reasoning_thinking_start|>", "<|reasoning_thinking_end|>"],
165+
len + specials.len(),
166+
));
167+
}
168+
199990.. => {
169+
log::debug!("Detected o200k vocab");
170+
config.split.push(Split::Pattern { pattern:
88171
Regex::new(&[
89172
r"[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?",
90173
r"[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?",
@@ -95,42 +178,66 @@ pub fn convert_tiktoken(data: impl AsRef<[u8]>) -> Result<Definition, Conversion
95178
].join("|"))?.into(),
96179
behavior: SplitBehavior::Isolate
97180
});
98-
&[("<|endoftext|>", 199999), ("<|endofprompt|>", 200018)]
99-
} else if vocab.len() >= 100000 {
100-
config.split.push(Split::Pattern { pattern:
181+
specials.extend([
182+
("<|endoftext|>".to_string(), 199999),
183+
("<|endofprompt|>".to_string(), 200018),
184+
]);
185+
}
186+
100000.. => {
187+
log::debug!("Detected cl100k vocab");
188+
config.split.push(Split::Pattern { pattern:
101189
Regex::new(r"'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)")?.into(),
102190
behavior: SplitBehavior::Isolate
103191
});
104-
&[
105-
("<|endoftext|>", 100257),
106-
("<|fim_prefix|>", 100258),
107-
("<|fim_middle|>", 100259),
108-
("<|fim_suffix|>", 100260),
109-
("<|endofprompt|>", 100276),
110-
("<|im_start|>", 100264),
111-
("<|im_end|>", 100265),
112-
]
113-
} else {
114-
config.split.push(Split::Pattern {
115-
pattern: Regex::new(r"'(?:[sdmt]|ll|ve|re)|\s?\p{L}+|\s?\p{N}+|\s?[^\s\p{L}\p{N}]+")?
192+
specials.extend(
193+
[
194+
("<|endoftext|>", 100257),
195+
("<|fim_prefix|>", 100258),
196+
("<|fim_middle|>", 100259),
197+
("<|fim_suffix|>", 100260),
198+
("<|endofprompt|>", 100276),
199+
("<|im_start|>", 100264),
200+
("<|im_end|>", 100265),
201+
]
202+
.map(|(s, n)| (s.to_string(), n)),
203+
);
204+
}
205+
_ => {
206+
log::debug!("Detected p50k vocab");
207+
config.split.push(Split::Pattern {
208+
pattern: Regex::new(
209+
r"'(?:[sdmt]|ll|ve|re)|\s?\p{L}+|\s?\p{N}+|\s?[^\s\p{L}\p{N}]+",
210+
)?
116211
.into(),
117-
behavior: SplitBehavior::Isolate,
118-
});
119-
&[
120-
("<|endoftext|>", 50256),
121-
("<|fim_prefix|>", 50281),
122-
("<|fim_middle|>", 50282),
123-
("<|fim_suffix|>", 50283),
124-
]
212+
behavior: SplitBehavior::Isolate,
213+
});
214+
specials.extend(
215+
[
216+
("<|endoftext|>", 50256),
217+
("<|fim_prefix|>", 50281),
218+
("<|fim_middle|>", 50282),
219+
("<|fim_suffix|>", 50283),
220+
]
221+
.map(|(s, n)| (s.to_string(), n)),
222+
);
223+
}
125224
};
126225
let mut specials = specials
127226
.iter()
128227
.enumerate()
129-
.map(|(i, &(s, t))| SpecialToken {
228+
.map(|(i, &(ref s, t))| SpecialToken {
130229
id: t,
131230
bytes: s.as_bytes().to_vec(),
132231
kind: SpecialTokenKind::Control,
133-
ident: None,
232+
ident: match s.as_str() {
233+
"<|begin_of_text|>" => Some("bos"),
234+
"<|end_of_text|>" | "<|endoftext|>" => Some("eos"),
235+
"<|eot|>" => Some("eot"),
236+
"<|eom|>" => Some("eom"),
237+
"<|finetune_right_pad|>" => Some("pad"),
238+
_ => None,
239+
}
240+
.map(|s| s.to_string()),
134241
score: i as f32,
135242
extract: true,
136243
})

0 commit comments

Comments
 (0)