GitHub - bgub/tokka-bench: benchmarks for LLM tokenizers

Tokka-Bench

Benchmark and compare tokenizers across many languages using real FineWeb-2 and StarCoder data. Higher bytes/token = more efficient; higher unique tokens = better coverage.

👉 Open the live Tokka-Bench dashboard

TL;DR

Any HuggingFace model or local tokenizer directory works
Real streamed data, fast parallel execution
Clean JSON results + interactive Streamlit dashboard

Install

git clone <repository-url>
cd tokka-bench
uv sync

Run a Benchmark

# Single tokenizer (2MB per language by default)
uv run benchmark tokenizer=openai-community/gpt2

# Multiple tokenizers in parallel
uv run benchmark tokenizers=openai-community/gpt2,meta-llama/Llama-3.1-8B max_workers=8

# Quick smoke test
uv run benchmark tokenizer=openai-community/gpt2 sample_size=0.1

# Custom output filename
uv run benchmark tokenizer=meta-llama/Meta-Llama-3-8B output_name=llama_results

# Run on a custom set of natural languages
uv run benchmark tokenizers=openai-community/gpt2 natural_lang_list=hin_Deva,jpn_Jpan,fra_Latn max_workers=8

# Example: comprehensive multi-tokenizer run
uv run benchmark tokenizer=openai-community/gpt2,google/gemma-3-27b-it,Xenova/gpt-4,meta-llama/Llama-3.1-8B,moonshotai/Kimi-K2-Instruct,Qwen/Qwen3-30B-A3B-Instruct-2507,openai/gpt-oss-120b max_workers=10

Results are saved to data/results/{name}.json.

Visualize Results

# Recommended launcher (pre-flight checks)
uv run dashboard

# Or run Streamlit directly
uv run streamlit run streamlit_app.py

Train Your Own Tokenizer (local) and Benchmark It

Train a tokenizer (e.g., with HuggingFace Tokenizers) and save it to a local folder, say ./my-tokenizer.
Benchmark it by pointing to the folder:

uv run benchmark tokenizer=./my-tokenizer output_name=my_tokenizer

Launch the dashboard to compare with others:

uv run dashboard

Tips:

You can control language counts: natural_n=99 code_n=20
Adjust speed vs accuracy via sample_size (MB)

What’s Measured

bytes_per_token: higher = more efficient
unique_tokens: higher = better script coverage
subword_fertility: subwords per word/character unit
continued_word_rate: % tokens that continue a word

How It Works (one paragraph)

Tokka-Bench streams real text from FineWeb-2 (top N natural languages), FineWeb (English), and StarCoder (top N coding languages). It tokenizes equal-sized samples per language, computes metrics, aggregates global Unicode/script stats, and saves results in a simple JSON format for the dashboard.

Reference: Command Line Arguments

--tokenizer:
- Description: Describes the tokenizer to benchmark. Usually a HuggingFace model tag e.g openai-community/gpt2.
- Usage: --tokenizer=openai-community/gpt2
--tokenizers:
- Description: Alternative to --tokenizer if multiple tokenizers are to be benchmarked. Separate different tokenizers by a comma(,).
- Usage: --tokenizers="openai-community/gpt2,meta-llama/Llama-3.1-8B"
--sample_size:
- Description: Size of the data to be benchmarked on (in MBs).
- Usage: --sample_size=100 (Benchmark on 100 MBs).
--output_name:
- Description: Directory name for benchmark output JSON.
- Usage: --output_name=MyOutput
--output_names:
- Description: If multiple tokenizers are being benchmarked and their results are to be stored separately you can use this field.
- Usage: --output_names="TokenizerBenchmark1,TokenizerBenchmark2" (Counts should match tokenizer count).
--max_workers:
- Description: Maximum CPU workers assigned to the benchmarking task.
- Usage: --max_workers=8
--natural_n:
- Description: Top N natural languages to benchmark on based on their total size in the FineWeb-2 corpus.
- Usage: --natural_n=10 (Defaults to 99)
--code_n:
- Description: Number of programming languages to benchmark on.
- Usage: --code_n=10 (Defaults to 20)
--natural_lang_list:
- Description: Use this if you want to test on specific set of natural languages rather than the top languages based on size. To specify a language use it's ISO-639 code along with it's script indicator e.g fra_Latn for french in latin script. Refer src/fineweb-2-languages.csv's Subset column to get these values for your desired language and script pairs.
- Usage: --natural_lang_list="hin_Deva,fra_Latn" (Test against Hindi and French).

Troubleshooting

Dashboard says no results: run a benchmark first to populate data/results/
Using private/remote models: set HF_TOKEN in your environment
Slow runs: lower sample_size, reduce natural_n/code_n, or increase max_workers

Contributing

PRs welcome! See CONTRIBUTING.md for guidelines.

License

MIT (or project license) — see LICENSE if provided.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.cursor/rules		.cursor/rules
.devcontainer		.devcontainer
.github		.github
data/results		data/results
images		images
src		src
.gitignore		.gitignore
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
pyproject.toml		pyproject.toml
streamlit_app.py		streamlit_app.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tokka-Bench

TL;DR

Install

Run a Benchmark

Visualize Results

Train Your Own Tokenizer (local) and Benchmark It

What’s Measured

How It Works (one paragraph)

Reference: Command Line Arguments

Troubleshooting

Contributing

License

About

Uh oh!

Releases

Packages

Contributors 2

Languages

bgub/tokka-bench

Folders and files

Latest commit

History

Repository files navigation

Tokka-Bench

TL;DR

Install

Run a Benchmark

Visualize Results

Train Your Own Tokenizer (local) and Benchmark It

What’s Measured

How It Works (one paragraph)

Reference: Command Line Arguments

Troubleshooting

Contributing

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages