Systemcluster
diff --git a/‎README.md‎
Lines changed: 16 additions & 11 deletions b/‎README.md‎
Lines changed: 16 additions & 11 deletions
diff --git a/‎packages/javascript/README.md‎
Lines changed: 9 additions & 4 deletions b/‎packages/javascript/README.md‎
Lines changed: 9 additions & 4 deletions
diff --git a/‎packages/python/README.md‎
Lines changed: 10 additions & 5 deletions b/‎packages/python/README.md‎
Lines changed: 10 additions & 5 deletions
diff --git a/‎src/convert/tiktoken.rs‎
Lines changed: 137 additions & 30 deletions b/‎src/convert/tiktoken.rs‎
Lines changed: 137 additions & 30 deletions
@@ -7,11 +7,11 @@
 
 **Tokenizer for language models.**
 
-<sup>**Tokenize text for Llama, Gemini, GPT-4, Mistral and many others; in the web, on the client and any platform.**</sup>
+<sup>**Tokenize text for Llama, Gemini, GPT-4, DeepSeek, Mistral and many others; in the web, on the client and any platform.**</sup>
 
 ```rust
 use kitoken::Kitoken;
-let encoder = Kitoken::from_file("models/llama3.kit")?;
+let encoder = Kitoken::from_file("models/llama4.kit")?;
 
 let tokens = encoder.encode("Your future belongs to me.", true)?;
 let string = String::from_utf8(encoder.decode(&tokens, true)?)?;
@@ -26,8 +26,8 @@ Kitoken is a fast and versatile tokenizer for language models compatible with [S
 - **Fast and efficient tokenization**\
   Faster than most other tokenizers in both common and uncommon scenarios; see the [benchmarks](#benchmarks) for comparisons with different datasets.
 - **Runs in all environments**\
-  Native in Rust and with bindings for Web, Node and Python; see [kitoken.dev](https://kitoken.dev) for a web demo.
-- **Support for normalization and pre-tokenization**\
+  Native in Rust and with bindings for [Web](./packages/javascript), [Node](./packages/javascript) and [Python](./packages/python); see [kitoken.dev](https://kitoken.dev) for a web demo.
+- **Supports input and output processing**\
   Including unicode-aware normalization, pre-tokenization and post-processing options.
 - **Compact data format**\
   Definitions are stored in an efficient binary format and without merge list.
@@ -36,10 +36,15 @@ Kitoken is a fast and versatile tokenizer for language models compatible with [S
 
 Kitoken can load and convert many existing tokenizer formats. Every supported format is [tested](./tests) against the original implementation across a variety of inputs to ensure correctness and compatibility.
 
+> [!NOTE]
+> Most models on [Hugging Face](https://huggingface.co) are supported. Just take the `tokenizer.json` or `spiece.model` of a model and load it into Kitoken.
+
+Kitoken aims to be output-identical with existing implementations for all models. See the notes below for differences in specific cases.
+
 ### SentencePiece
 
 ```rust
-let encoder = Kitoken::from_sentencepiece_file("models/mistral.model")?;
+let encoder = Kitoken::from_file("models/gemma.model")?;
 ```
 
 Kitoken can convert and initialize with SentencePiece models in `BPE` and `Unigram` format.
@@ -60,7 +65,7 @@ If the model does not contain a trainer definition, `Unigram` is assumed as the
 ### Tokenizers
 
 ```rust
-let encoder = Kitoken::from_tokenizers_file("models/llama3.json")?;
+let encoder = Kitoken::from_file("models/llama4.json")?;
 ```
 
 Kitoken can convert and initialize with HuggingFace Tokenizers definitions for `BPE`, `Unigram` and `WordPiece` models.
@@ -76,29 +81,29 @@ Some normalization, post-processing and decoding options used by Tokenizers are
 <details>
 <summary>Notes</summary>
 
-- When using a `BPE` definition with an incomplete vocabulary and without an `unk` token, Tokenizers skips over non-encodable pieces and attempts to merge the surrounding ones. Kitoken always considers non-encodable pieces as un-mergeable and encodes the surrounding pieces individually. This can result in different encodings depending on vocabulary coverage and inputs in this scenario.
+- When using a `BPE` definition with an incomplete vocabulary and without an `unk` token, Tokenizers skips over non-encodable pieces and attempts to merge the surrounding ones. Kitoken always considers non-encodable pieces as un-mergeable and encodes the surrounding pieces individually. This can affect models that exploit the behavior of Tokenizers with a deliberately restricted vocabulary.
 - Tokenizers normalizes inputs character-by-character, while Kitoken normalizes inputs as one. This can result in differences during case-folding in some cases. For example, greek letter `Σ` has two lowercase forms, `σ` for within-word and `ς` for end-of-word use. Tokenizers will always lowercase `Σ` to `σ`, while Kitoken will lowercase it to either depending on the context.
 
 </details>
 
 ### Tiktoken
 
 ```rust
-let encoder = Kitoken::from_tiktoken_file("models/cl100k_base.tiktoken")?;
+let encoder = Kitoken::from_file("models/o200k_base.tiktoken")?;
 ```
 
-Tiktoken is a `BPE` tokenizer with a custom definition format used by OpenAI for GPT-3 and newer models using `BytePair` tokenization in byte mode.
+Tiktoken is a `BPE` tokenizer used by OpenAI for GPT-3 and newer models and uses `BytePair` tokenization in byte mode.
 
 Tiktoken definitions contain a sorted vocabulary of base64 encoded bytes and corresponding token ids without any additional metadata. Special tokens and the split regex are expected to be provided separately, but will be inferred from the data for common models including GPT-3, GPT-4 and GPT-4o.
 For other models, or depending on the data and requirements, these values can be adjusted manually.
 
 ### Tekken
 
 ```rust
-let encoder = Kitoken::from_tekken_file("models/tekken.json")?;
+let encoder = Kitoken::from_file("models/mistral.json")?;
 ```
 
-Tekken is a `BPE` tokenizer with a custom definition format based on Tiktoken, used by Mistral for NeMo and newer models using `BytePair` tokenization in byte mode.
+Tekken is a `BPE` tokenizer based on Tiktoken, used by Mistral for NeMo and newer models and uses `BytePair` tokenization in byte mode.
 
 Tekken definitions contain a sorted vocabulary of base64 encoded bytes and corresponding token ids, as well as metadata including the split regex and special tokens.
 
 
@@ -1,13 +1,18 @@
 # kitoken
 
+[![Crates.io](https://img.shields.io/crates/v/kitoken)](https://crates.io/crates/kitoken)
+[![NPM](https://img.shields.io/npm/v/kitoken)](https://www.npmjs.com/package/kitoken)
+[![PyPI](https://img.shields.io/pypi/v/kitoken)](https://pypi.org/project/kitoken)
+[![Tests & Checks](https://img.shields.io/github/actions/workflow/status/Systemcluster/kitoken/tests.yml?label=tests%20%26%20checks)](https://github.com/Systemcluster/kitoken/actions/workflows/tests.yml)
+
 **Tokenizer for language models.**
 
-<sup>**Tokenize text for Llama, Gemini, GPT-4, Mistral and many others; in the web, on the client and any platform.**</sup>
+<sup>**Tokenize text for Llama, Gemini, GPT-4, DeepSeek, Mistral and many others; in the web, on the client and any platform.**</sup>
 
 ```js
 import { Kitoken } from "kitoken/node"
 
-const model = fs.readFileSync("models/llama3.3.model")
+const model = fs.readFileSync("models/llama4.model")
 const encoder = new Kitoken(model)
 
 const tokens = encoder.encode("hello world!", true)
@@ -21,8 +26,8 @@ Kitoken is a fast and versatile tokenizer for language models compatible with [S
 - **Fast and efficient tokenization**\
   Faster than most other tokenizers in both common and uncommon scenarios; see the [benchmarks](//github.com/Systemcluster/kitoken#benchmarks) for comparisons with different datasets.
 - **Runs in all environments**\
-  Native in Rust and with bindings for Web, Node and Python; see [kitoken.dev](https://kitoken.dev) for a web demo.
-- **Support for normalization and pre-tokenization**\
+  Native in Rust and with bindings for [Web](./packages/javascript), [Node](./packages/javascript) and [Python](./packages/python); see [kitoken.dev](https://kitoken.dev) for a web demo.
+- **Supports input and output processing**\
   Including unicode-aware normalization, pre-tokenization and post-processing options.
 - **Compact data format**\
   Definitions are stored in an efficient binary format and without merge list.
 
@@ -1,13 +1,18 @@
 # kitoken
 
+[![Crates.io](https://img.shields.io/crates/v/kitoken)](https://crates.io/crates/kitoken)
+[![NPM](https://img.shields.io/npm/v/kitoken)](https://www.npmjs.com/package/kitoken)
+[![PyPI](https://img.shields.io/pypi/v/kitoken)](https://pypi.org/project/kitoken)
+[![Tests & Checks](https://img.shields.io/github/actions/workflow/status/Systemcluster/kitoken/tests.yml?label=tests%20%26%20checks)](https://github.com/Systemcluster/kitoken/actions/workflows/tests.yml)
+
 **Tokenizer for language models.**
 
-<sup>**Tokenize text for Llama, Gemini, GPT-4, Mistral and many others; in the web, on the client and any platform.**</sup>
+<sup>**Tokenize text for Llama, Gemini, GPT-4, DeepSeek, Mistral and many others; in the web, on the client and any platform.**</sup>
 
 ```py
 from kitoken import Kitoken
 
-encoder = Kitoken.from_file("models/llama3.3.model")
+encoder = Kitoken.from_file("models/llama4.model")
 
 tokens = encoder.encode("hello world!", True)
 string = encoder.decode(tokens).decode("utf-8")
@@ -22,9 +27,9 @@ Kitoken is a fast and versatile tokenizer for language models compatible with [S
 - **Fast and efficient tokenization**\
   Faster than most other tokenizers in both common and uncommon scenarios; see the [benchmarks](//github.com/Systemcluster/kitoken#benchmarks) for comparisons with different datasets.
 - **Runs in all environments**\
-  Native in Rust and with bindings for Web, Node and Python; see [kitoken.dev](https://kitoken.dev) for a web demo.
-- **Support for normalization and pre-tokenization**\
-  Including unicode-aware normalization, pre-tokenization and post-processing options.
+  Native in Rust and with bindings for [Web](./packages/javascript), [Node](./packages/javascript) and [Python](./packages/python); see [kitoken.dev](https://kitoken.dev) for a web demo.
+- **Supports input and output processing**\
+  Including unicode-aware normalization, pre-tokenization and post-decoding options.
 - **Compact data format**\
   Definitions are stored in an efficient binary format and without merge list.
 
 
@@ -6,16 +6,16 @@ use std::io::Read;
 use std::path::Path;
 
 use alloc::format;
-use alloc::string::ToString;
+use alloc::string::{String, ToString};
 use alloc::vec::Vec;
 
 use base64::{alphabet, engine, Engine};
 use bstr::ByteSlice;
 
 use crate::convert::ConversionError;
 use crate::{
-    Configuration, Definition, Fallback, Kitoken, Metadata, Model, Regex, SpecialToken,
-    SpecialTokenKind, SpecialVocab, Split, SplitBehavior, Vocab,
+    Configuration, Definition, Fallback, InsertionPosition, Kitoken, Metadata, Model, Regex,
+    SpecialToken, SpecialTokenKind, SpecialVocab, Split, SplitBehavior, Template, Vocab,
 };
 
 static BASE64: engine::GeneralPurpose =
@@ -83,8 +83,91 @@ pub fn convert_tiktoken(data: impl AsRef<[u8]>) -> Result<Definition, Conversion
     let mut config = Configuration::default();
     config.fallback.push(Fallback::Skip);
 
-    let specials: &[(&str, u32)] = if vocab.len() >= 199990 {
-        config.split.push(Split::Pattern { pattern:
+    let mut specials = Vec::<(String, u32)>::with_capacity(2048);
+    let reserved = move |name, count, start, pos| {
+        (start..count + start)
+            .enumerate()
+            .map(move |(n, i)| (format!("<|{name}reserved_special_token_{i}|>"), (pos + n) as u32))
+    };
+    let sequential = move |list: &'static [&'static str], pos| {
+        list.iter().enumerate().map(move |(n, s)| (s.to_string(), (pos + n) as u32))
+    };
+    match vocab.len() {
+        len @ 200000 => {
+            log::debug!("Detected llama4 vocab");
+            config.split.push(Split::Pattern { pattern:
+                Regex::new(&[
+                    r"[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?",
+                    r"[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?",
+                    r"\p{N}{1,3}",
+                    r" ?[^\s\p{L}\p{N}]+[\r\n/]*",
+                    r"\s*[\r\n]+",
+                    r"\s+(?!\S)",
+                ].join("|"))?.into(),
+                behavior: SplitBehavior::Isolate
+            });
+            config.templates.push(Template {
+                content:  "<|begin_of_text|>".to_string(),
+                position: InsertionPosition::SequenceStart,
+            });
+            config.templates.push(Template {
+                content:  "<|end_of_text|>".to_string(),
+                position: InsertionPosition::SequenceEnd,
+            });
+            // Ref: https://github.com/meta-llama/llama-models/blob/main/models/llama4/tokenizer.py
+            specials.extend(sequential(
+                &[
+                    "<|begin_of_text|>",
+                    "<|end_of_text|>",
+                    "<|fim_prefix|>",
+                    "<|fim_middle|>",
+                    "<|fim_suffix|>",
+                    "<|header_start|>",
+                    "<|header_end|>",
+                    "<|eom|>",
+                    "<|eot|>",
+                    "<|step|>",
+                ],
+                len,
+            ));
+            specials.extend(reserved("text_post_train_", 6, 0, len + specials.len()));
+            specials.extend(sequential(
+                &[
+                    "<|python_start|>",
+                    "<|python_end|>",
+                    "<|finetune_right_pad|>",
+                ],
+                len + specials.len(),
+            ));
+            specials.extend(reserved("text_post_train_", 61, 8, len + specials.len()));
+            specials.extend(sequential(
+                &[
+                    "<|image_start|>",
+                    "<|image_end|>",
+                    "<|vision_reserved_special_token_0|>",
+                    "<|vision_reserved_special_token_1|>",
+                    "<|tile_x_separator|>",
+                    "<|tile_y_separator|>",
+                    "<|vision_reserved_special_token_2|>",
+                    "<|vision_reserved_special_token_3|>",
+                    "<|vision_reserved_special_token_4|>",
+                    "<|vision_reserved_special_token_5|>",
+                    "<|image|>",
+                    "<|vision_reserved_special_token_6|>",
+                    "<|patch|>",
+                ],
+                len + specials.len(),
+            ));
+            specials.extend(reserved("vision_", 1041, 7, len + specials.len()));
+            specials.extend(reserved("reasoning_", 7, 0, len + specials.len()));
+            specials.extend(sequential(
+                &["<|reasoning_thinking_start|>", "<|reasoning_thinking_end|>"],
+                len + specials.len(),
+            ));
+        }
+        199990.. => {
+            log::debug!("Detected o200k vocab");
+            config.split.push(Split::Pattern { pattern:
             Regex::new(&[
                 r"[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?",
                 r"[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?",
@@ -95,42 +178,66 @@ pub fn convert_tiktoken(data: impl AsRef<[u8]>) -> Result<Definition, Conversion
             ].join("|"))?.into(),
             behavior: SplitBehavior::Isolate
         });
-        &[("<|endoftext|>", 199999), ("<|endofprompt|>", 200018)]
-    } else if vocab.len() >= 100000 {
-        config.split.push(Split::Pattern { pattern:
+            specials.extend([
+                ("<|endoftext|>".to_string(), 199999),
+                ("<|endofprompt|>".to_string(), 200018),
+            ]);
+        }
+        100000.. => {
+            log::debug!("Detected cl100k vocab");
+            config.split.push(Split::Pattern { pattern:
             Regex::new(r"'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)")?.into(),
             behavior: SplitBehavior::Isolate
         });
-        &[
-            ("<|endoftext|>", 100257),
-            ("<|fim_prefix|>", 100258),
-            ("<|fim_middle|>", 100259),
-            ("<|fim_suffix|>", 100260),
-            ("<|endofprompt|>", 100276),
-            ("<|im_start|>", 100264),
-            ("<|im_end|>", 100265),
-        ]
-    } else {
-        config.split.push(Split::Pattern {
-            pattern:  Regex::new(r"'(?:[sdmt]|ll|ve|re)|\s?\p{L}+|\s?\p{N}+|\s?[^\s\p{L}\p{N}]+")?
+            specials.extend(
+                [
+                    ("<|endoftext|>", 100257),
+                    ("<|fim_prefix|>", 100258),
+                    ("<|fim_middle|>", 100259),
+                    ("<|fim_suffix|>", 100260),
+                    ("<|endofprompt|>", 100276),
+                    ("<|im_start|>", 100264),
+                    ("<|im_end|>", 100265),
+                ]
+                .map(|(s, n)| (s.to_string(), n)),
+            );
+        }
+        _ => {
+            log::debug!("Detected p50k vocab");
+            config.split.push(Split::Pattern {
+                pattern:  Regex::new(
+                    r"'(?:[sdmt]|ll|ve|re)|\s?\p{L}+|\s?\p{N}+|\s?[^\s\p{L}\p{N}]+",
+                )?
                 .into(),
-            behavior: SplitBehavior::Isolate,
-        });
-        &[
-            ("<|endoftext|>", 50256),
-            ("<|fim_prefix|>", 50281),
-            ("<|fim_middle|>", 50282),
-            ("<|fim_suffix|>", 50283),
-        ]
+                behavior: SplitBehavior::Isolate,
+            });
+            specials.extend(
+                [
+                    ("<|endoftext|>", 50256),
+                    ("<|fim_prefix|>", 50281),
+                    ("<|fim_middle|>", 50282),
+                    ("<|fim_suffix|>", 50283),
+                ]
+                .map(|(s, n)| (s.to_string(), n)),
+            );
+        }
     };
     let mut specials = specials
         .iter()
         .enumerate()
-        .map(|(i, &(s, t))| SpecialToken {
+        .map(|(i, &(ref s, t))| SpecialToken {
             id:      t,
             bytes:   s.as_bytes().to_vec(),
             kind:    SpecialTokenKind::Control,
-            ident:   None,
+            ident:   match s.as_str() {
+                "<|begin_of_text|>" => Some("bos"),
+                "<|end_of_text|>" | "<|endoftext|>" => Some("eos"),
+                "<|eot|>" => Some("eot"),
+                "<|eom|>" => Some("eom"),
+                "<|finetune_right_pad|>" => Some("pad"),
+                _ => None,
+            }
+            .map(|s| s.to_string()),
             score:   i as f32,
             extract: true,
         })