Faster Whitespace PreTokenizer (Drop-in Replacement) #1822
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
🚀 Faster Whitespace PreTokenizer (Drop-in Replacement)
This PR replaces the current
Whitespace
pre-tokenizer implementation with an optimized version that achieves consistent 10–30% performance improvements across short, medium, and long inputs — with identical output behavior.🔧 Changes
✅ Replaced
whitespace.rs
with a new implementation using manualchar_indices()
traversal (no regex).✅ Added a
WhitespaceSplit
variant for simpler whitespace-only tokenization.✅ Updated unit tests to verify correctness and output compatibility.
✅ Added
whitespace_bench.rs
inbenches/
, using Criterion.✅ Updated
Cargo.toml
to register the benchmark:⚡ Benchmarks (Criterion)
Benchmarks were run across five full test cycles to minimize outliers and assess stability.
🧪 Inputs
Short:
"Hello world!"
(~10–20 chars)Medium: Sentences with spaces, tabs, punctuation (~100–150 chars)
Long: Large paragraphs repeated 3× (~5,000+ chars)
✅ Optimized Version (New)
🧬 Output Compatibility
Produces the exact same pre-tokenization splits as the current version.
Word boundaries, punctuation, and whitespace are handled identically.
Includes robust unit tests verifying span offsets and output strings.
🧠 Technical Improvements
No regex: replaced with a simple and cache-efficient
char_indices()
iterator loop.Span classification is done in-place: word, whitespace, punctuation.
Avoids unnecessary allocations or dependencies.
Fully backward-compatible and implements
impl_serde_type!
.📎 Related Issue
Addresses the motivation in #1820:
While this PR doesn't solve that issue directly, it improves local testing coverage and adds Criterion-based benchmarks so others can independently validate behavior and performance — without needing external test datasets.
🙌 Closing
Whitespace is used everywhere in tokenization — from LLM pretraining to inference. Optimizing its performance has cascading effects at scale, especially in multithreaded and batched pipelines.
Thank you for maintaining this incredible library. Let me know if you'd like additional changes — such as splitting this into a side-by-side version (
WhitespaceFast
) for testing — but this PR is designed as a safe drop-in upgrade.Best,
AndriaK