Proposal: Faster `Whitespace` PreTokenizer Implementation (10–30% Speedup)

<p>Hi Hugging Face team 👋,</p>
<p>I’d like to propose replacing the current <code inline="">Whitespace</code> <code inline="">PreTokenizer</code> in <code inline="">tokenizers</code> with a faster implementation I developed. It achieves consistent <strong>10–30% performance improvements</strong> across short, medium, and long inputs, while preserving <strong>identical output behavior</strong>.</p>
<hr>
<h3>🚀 Why This Matters</h3>
<p><code inline="">Whitespace</code> is a foundational component used in many pipelines, especially in LLM pretraining, tokenization benchmarks, and inference preprocessing. Any improvement here brings a <strong>compounding benefit at scale</strong>, especially in multi-threaded, batched workflows.</p>
<hr>
<h3>⚡ Benchmarks (Criterion)</h3>
<p>I benchmarked both implementations across multiple runs with consistent patterns:</p>
<h4>🧪 Inputs</h4>
<ul>
<li>
<p><strong>Short</strong>: e.g., <code inline="">"Hello world!"</code> (~10–20 characters)</p>
</li>
<li>
<p><strong>Medium</strong>: typical sentences (~100–150 characters)</p>
</li>
<li>
<p><strong>Long</strong>: paragraphs or documents (~5,000+ characters)</p>
</li>
</ul>
<hr>
<h4>✅ Optimized Version (mine)</h4>

Input Type | Time (avg) | Change
-- | -- | --
Short | 549–559 ns | 10–15% faster
Medium | 3.86–4.01 µs | 5–30% faster
Long | 50.8–71 µs | 5–15% faster, more stable


<hr>
<h3>🧬 Output Compatibility</h3>
<ul>
<li>
<p>Produces the <strong>same pre-tokenization splits</strong> as the original</p>
</li>
<li>
<p><strong>Word boundaries, punctuation, and whitespace</strong> are handled identically</p>
</li>
<li>
<p>Includes <strong>unit tests</strong> that confirm offset and string correctness</p>
</li>
</ul>
<hr>
<h3>🔧 Technical Summary</h3>
<ul>
<li>
<p>Replaces regex-based character matching with a <strong>manual <code inline="">char_indices()</code> loop</strong></p>
</li>
<li>
<p>Classifies spans as word, whitespace, or punctuation without allocations</p>
</li>
<li>
<p>No external dependencies</p>
</li>
<li>
<p>Cleaner and more cache-friendly structure</p>
</li>
<li>
<p>Fully backward compatible, including <code inline="">impl_serde_type!</code></p>
</li>
</ul>
<hr>
<h3>📦 Integration Options</h3>
<p>I'd be happy to:</p>
<ul>
<li>
<p>Submit a PR replacing the current implementation</p>
</li>
<li>
<p>Or submit it alongside as <code inline="">WhitespaceFast</code> for side-by-side evaluation</p>
</li>
</ul>
<hr>
<p>Thanks again for maintaining this fantastic library. Let me know your preferences and I’ll submit the PR accordingly! 🤗</p>
<p>Best,<br>
<strong>AndriaK</strong></p>
</body>
</html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Faster `Whitespace` PreTokenizer Implementation (10–30% Speedup) #1821

🚀 Why This Matters

⚡ Benchmarks (Criterion)

🧪 Inputs

✅ Optimized Version (mine)

🧬 Output Compatibility

🔧 Technical Summary

📦 Integration Options

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Input Type	Time (avg)	Change
Short	549–559 ns	10–15% faster
Medium	3.86–4.01 µs	5–30% faster
Long	50.8–71 µs	5–15% faster, more stable

Proposal: Faster Whitespace PreTokenizer Implementation (10–30% Speedup) #1821

Description

🚀 Why This Matters

⚡ Benchmarks (Criterion)

🧪 Inputs

✅ Optimized Version (mine)

🧬 Output Compatibility

🔧 Technical Summary

📦 Integration Options

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Proposal: Faster `Whitespace` PreTokenizer Implementation (10–30% Speedup) #1821