You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -26,8 +26,8 @@ Kitoken is a fast and versatile tokenizer for language models compatible with [S
26
26
-**Fast and efficient tokenization**\
27
27
Faster than most other tokenizers in both common and uncommon scenarios; see the [benchmarks](#benchmarks) for comparisons with different datasets.
28
28
-**Runs in all environments**\
29
-
Native in Rust and with bindings for Web, Node and Python; see [kitoken.dev](https://kitoken.dev) for a web demo.
30
-
-**Support for normalization and pre-tokenization**\
29
+
Native in Rust and with bindings for [Web](./packages/javascript), [Node](./packages/javascript) and [Python](./packages/python); see [kitoken.dev](https://kitoken.dev) for a web demo.
30
+
-**Supports input and output processing**\
31
31
Including unicode-aware normalization, pre-tokenization and post-processing options.
32
32
-**Compact data format**\
33
33
Definitions are stored in an efficient binary format and without merge list.
@@ -36,10 +36,15 @@ Kitoken is a fast and versatile tokenizer for language models compatible with [S
36
36
37
37
Kitoken can load and convert many existing tokenizer formats. Every supported format is [tested](./tests) against the original implementation across a variety of inputs to ensure correctness and compatibility.
38
38
39
+
> [!NOTE]
40
+
> Most models on [Hugging Face](https://huggingface.co) are supported. Just take the `tokenizer.json` or `spiece.model` of a model and load it into Kitoken.
41
+
42
+
Kitoken aims to be output-identical with existing implementations for all models. See the notes below for differences in specific cases.
Kitoken can convert and initialize with HuggingFace Tokenizers definitions for `BPE`, `Unigram` and `WordPiece` models.
@@ -76,29 +81,29 @@ Some normalization, post-processing and decoding options used by Tokenizers are
76
81
<details>
77
82
<summary>Notes</summary>
78
83
79
-
- When using a `BPE` definition with an incomplete vocabulary and without an `unk` token, Tokenizers skips over non-encodable pieces and attempts to merge the surrounding ones. Kitoken always considers non-encodable pieces as un-mergeable and encodes the surrounding pieces individually. This can result in different encodings depending on vocabulary coverage and inputs in this scenario.
84
+
- When using a `BPE` definition with an incomplete vocabulary and without an `unk` token, Tokenizers skips over non-encodable pieces and attempts to merge the surrounding ones. Kitoken always considers non-encodable pieces as un-mergeable and encodes the surrounding pieces individually. This can affect models that exploit the behavior of Tokenizers with a deliberately restricted vocabulary.
80
85
- Tokenizers normalizes inputs character-by-character, while Kitoken normalizes inputs as one. This can result in differences during case-folding in some cases. For example, greek letter `Σ` has two lowercase forms, `σ` for within-word and `ς` for end-of-word use. Tokenizers will always lowercase `Σ` to `σ`, while Kitoken will lowercase it to either depending on the context.
Tiktoken is a `BPE` tokenizer with a custom definition format used by OpenAI for GPT-3 and newer models using`BytePair` tokenization in byte mode.
95
+
Tiktoken is a `BPE` tokenizer used by OpenAI for GPT-3 and newer models and uses`BytePair` tokenization in byte mode.
91
96
92
97
Tiktoken definitions contain a sorted vocabulary of base64 encoded bytes and corresponding token ids without any additional metadata. Special tokens and the split regex are expected to be provided separately, but will be inferred from the data for common models including GPT-3, GPT-4 and GPT-4o.
93
98
For other models, or depending on the data and requirements, these values can be adjusted manually.
Tekken is a `BPE` tokenizer with a custom definition format based on Tiktoken, used by Mistral for NeMo and newer models using`BytePair` tokenization in byte mode.
106
+
Tekken is a `BPE` tokenizer based on Tiktoken, used by Mistral for NeMo and newer models and uses`BytePair` tokenization in byte mode.
102
107
103
108
Tekken definitions contain a sorted vocabulary of base64 encoded bytes and corresponding token ids, as well as metadata including the split regex and special tokens.
@@ -21,8 +26,8 @@ Kitoken is a fast and versatile tokenizer for language models compatible with [S
21
26
-**Fast and efficient tokenization**\
22
27
Faster than most other tokenizers in both common and uncommon scenarios; see the [benchmarks](//github.com/Systemcluster/kitoken#benchmarks) for comparisons with different datasets.
23
28
-**Runs in all environments**\
24
-
Native in Rust and with bindings for Web, Node and Python; see [kitoken.dev](https://kitoken.dev) for a web demo.
25
-
-**Support for normalization and pre-tokenization**\
29
+
Native in Rust and with bindings for [Web](./packages/javascript), [Node](./packages/javascript) and [Python](./packages/python); see [kitoken.dev](https://kitoken.dev) for a web demo.
30
+
-**Supports input and output processing**\
26
31
Including unicode-aware normalization, pre-tokenization and post-processing options.
27
32
-**Compact data format**\
28
33
Definitions are stored in an efficient binary format and without merge list.
@@ -22,9 +27,9 @@ Kitoken is a fast and versatile tokenizer for language models compatible with [S
22
27
-**Fast and efficient tokenization**\
23
28
Faster than most other tokenizers in both common and uncommon scenarios; see the [benchmarks](//github.com/Systemcluster/kitoken#benchmarks) for comparisons with different datasets.
24
29
-**Runs in all environments**\
25
-
Native in Rust and with bindings for Web, Node and Python; see [kitoken.dev](https://kitoken.dev) for a web demo.
26
-
-**Support for normalization and pre-tokenization**\
27
-
Including unicode-aware normalization, pre-tokenization and post-processing options.
30
+
Native in Rust and with bindings for [Web](./packages/javascript), [Node](./packages/javascript) and [Python](./packages/python); see [kitoken.dev](https://kitoken.dev) for a web demo.
31
+
-**Supports input and output processing**\
32
+
Including unicode-aware normalization, pre-tokenization and post-decoding options.
28
33
-**Compact data format**\
29
34
Definitions are stored in an efficient binary format and without merge list.
0 commit comments