Benchmark and compare tokenizers across many languages using real FineWeb-2 and StarCoder data. Higher bytes/token = more efficient; higher unique tokens = better coverage.
👉 Open the live Tokka-Bench dashboard
- Any HuggingFace model or local tokenizer directory works
- Real streamed data, fast parallel execution
- Clean JSON results + interactive Streamlit dashboard
git clone <repository-url>
cd tokka-bench
uv sync
# Single tokenizer (2MB per language by default)
uv run benchmark tokenizer=openai-community/gpt2
# Multiple tokenizers in parallel
uv run benchmark tokenizers=openai-community/gpt2,meta-llama/Llama-3.1-8B max_workers=8
# Quick smoke test
uv run benchmark tokenizer=openai-community/gpt2 sample_size=0.1
# Custom output filename
uv run benchmark tokenizer=meta-llama/Meta-Llama-3-8B output_name=llama_results
# Run on a custom set of natural languages
uv run benchmark tokenizers=openai-community/gpt2 natural_lang_list=hin_Deva,jpn_Jpan,fra_Latn max_workers=8
# Example: comprehensive multi-tokenizer run
uv run benchmark tokenizer=openai-community/gpt2,google/gemma-3-27b-it,Xenova/gpt-4,meta-llama/Llama-3.1-8B,moonshotai/Kimi-K2-Instruct,Qwen/Qwen3-30B-A3B-Instruct-2507,openai/gpt-oss-120b max_workers=10
Results are saved to data/results/{name}.json
.
# Recommended launcher (pre-flight checks)
uv run dashboard
# Or run Streamlit directly
uv run streamlit run streamlit_app.py
-
Train a tokenizer (e.g., with HuggingFace Tokenizers) and save it to a local folder, say
./my-tokenizer
. -
Benchmark it by pointing to the folder:
uv run benchmark tokenizer=./my-tokenizer output_name=my_tokenizer
- Launch the dashboard to compare with others:
uv run dashboard
Tips:
- You can control language counts:
natural_n=99 code_n=20
- Adjust speed vs accuracy via
sample_size
(MB)
- bytes_per_token: higher = more efficient
- unique_tokens: higher = better script coverage
- subword_fertility: subwords per word/character unit
- continued_word_rate: % tokens that continue a word
Tokka-Bench streams real text from FineWeb-2 (top N natural languages), FineWeb (English), and StarCoder (top N coding languages). It tokenizes equal-sized samples per language, computes metrics, aggregates global Unicode/script stats, and saves results in a simple JSON format for the dashboard.
--tokenizer
:- Description: Describes the tokenizer to benchmark. Usually a HuggingFace model tag e.g
openai-community/gpt2
. - Usage:
--tokenizer=openai-community/gpt2
- Description: Describes the tokenizer to benchmark. Usually a HuggingFace model tag e.g
--tokenizers
:- Description: Alternative to
--tokenizer
if multiple tokenizers are to be benchmarked. Separate different tokenizers by a comma(,). - Usage:
--tokenizers="openai-community/gpt2,meta-llama/Llama-3.1-8B"
- Description: Alternative to
--sample_size
:- Description: Size of the data to be benchmarked on (in MBs).
- Usage:
--sample_size=100
(Benchmark on 100 MBs).
--output_name
:- Description: Directory name for benchmark output JSON.
- Usage:
--output_name=MyOutput
--output_names
:- Description: If multiple tokenizers are being benchmarked and their results are to be stored separately you can use this field.
- Usage:
--output_names="TokenizerBenchmark1,TokenizerBenchmark2"
(Counts should match tokenizer count).
--max_workers
:- Description: Maximum CPU workers assigned to the benchmarking task.
- Usage:
--max_workers=8
--natural_n
:- Description: Top N natural languages to benchmark on based on their total size in the FineWeb-2 corpus.
- Usage:
--natural_n=10
(Defaults to 99)
--code_n
:- Description: Number of programming languages to benchmark on.
- Usage:
--code_n=10
(Defaults to 20)
--natural_lang_list
:- Description: Use this if you want to test on specific set of natural languages rather than the top languages based on size. To specify a language use it's ISO-639 code along with it's script indicator e.g
fra_Latn
for french in latin script. Refersrc/fineweb-2-languages.csv
'sSubset
column to get these values for your desired language and script pairs. - Usage:
--natural_lang_list="hin_Deva,fra_Latn"
(Test against Hindi and French).
- Description: Use this if you want to test on specific set of natural languages rather than the top languages based on size. To specify a language use it's ISO-639 code along with it's script indicator e.g
- Dashboard says no results: run a benchmark first to populate
data/results/
- Using private/remote models: set
HF_TOKEN
in your environment - Slow runs: lower
sample_size
, reducenatural_n
/code_n
, or increasemax_workers
PRs welcome! See CONTRIBUTING.md
for guidelines.
MIT (or project license) — see LICENSE if provided.