Skip to content

Proposal: Faster Whitespace PreTokenizer Implementation (10–30% Speedup)Β #1821

@8ria

Description

@8ria

Hi Hugging Face team πŸ‘‹,

I’d like to propose replacing the current Whitespace PreTokenizer in tokenizers with a faster implementation I developed. It achieves consistent 10–30% performance improvements across short, medium, and long inputs, while preserving identical output behavior.


πŸš€ Why This Matters

Whitespace is a foundational component used in many pipelines, especially in LLM pretraining, tokenization benchmarks, and inference preprocessing. Any improvement here brings a compounding benefit at scale, especially in multi-threaded, batched workflows.


⚑ Benchmarks (Criterion)

I benchmarked both implementations across multiple runs with consistent patterns:

πŸ§ͺ Inputs

  • Short: e.g., "Hello world!" (~10–20 characters)

  • Medium: typical sentences (~100–150 characters)

  • Long: paragraphs or documents (~5,000+ characters)


βœ… Optimized Version (mine)

Input Type Time (avg) Change
Short 549–559 ns 10–15% faster
Medium 3.86–4.01 Β΅s 5–30% faster
Long 50.8–71 Β΅s 5–15% faster, more stable

🧬 Output Compatibility

  • Produces the same pre-tokenization splits as the original

  • Word boundaries, punctuation, and whitespace are handled identically

  • Includes unit tests that confirm offset and string correctness


πŸ”§ Technical Summary

  • Replaces regex-based character matching with a manual char_indices() loop

  • Classifies spans as word, whitespace, or punctuation without allocations

  • No external dependencies

  • Cleaner and more cache-friendly structure

  • Fully backward compatible, including impl_serde_type!


πŸ“¦ Integration Options

I'd be happy to:

  • Submit a PR replacing the current implementation

  • Or submit it alongside as WhitespaceFast for side-by-side evaluation


Thanks again for maintaining this fantastic library. Let me know your preferences and I’ll submit the PR accordingly! πŸ€—

Best,
AndriaK

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions