-
Notifications
You must be signed in to change notification settings - Fork 936
Description
Hi Hugging Face team π,
Iβd like to propose replacing the current Whitespace
PreTokenizer
in tokenizers
with a faster implementation I developed. It achieves consistent 10β30% performance improvements across short, medium, and long inputs, while preserving identical output behavior.
π Why This Matters
Whitespace
is a foundational component used in many pipelines, especially in LLM pretraining, tokenization benchmarks, and inference preprocessing. Any improvement here brings a compounding benefit at scale, especially in multi-threaded, batched workflows.
β‘ Benchmarks (Criterion)
I benchmarked both implementations across multiple runs with consistent patterns:
π§ͺ Inputs
-
Short: e.g.,
"Hello world!"
(~10β20 characters) -
Medium: typical sentences (~100β150 characters)
-
Long: paragraphs or documents (~5,000+ characters)
β Optimized Version (mine)
Input Type | Time (avg) | Change |
---|---|---|
Short | 549β559 ns | 10β15% faster |
Medium | 3.86β4.01 Β΅s | 5β30% faster |
Long | 50.8β71 Β΅s | 5β15% faster, more stable |
𧬠Output Compatibility
-
Produces the same pre-tokenization splits as the original
-
Word boundaries, punctuation, and whitespace are handled identically
-
Includes unit tests that confirm offset and string correctness
π§ Technical Summary
-
Replaces regex-based character matching with a manual
char_indices()
loop -
Classifies spans as word, whitespace, or punctuation without allocations
-
No external dependencies
-
Cleaner and more cache-friendly structure
-
Fully backward compatible, including
impl_serde_type!
π¦ Integration Options
I'd be happy to:
-
Submit a PR replacing the current implementation
-
Or submit it alongside as
WhitespaceFast
for side-by-side evaluation
Thanks again for maintaining this fantastic library. Let me know your preferences and Iβll submit the PR accordingly! π€
Best,
AndriaK