⚡ FastBunkai is a Python library that splits long Japanese and English texts into natural sentences, providing a highly compatible API with megagonlabs/bunkai while its Rust core delivers roughly 40–285× faster segmentation than the original Python implementation. Existing bunkai users can swap to FastBunkai by importing it via from fast_bunkai import FastBunkai as Bunkai, making it a true drop-in replacement.
⚡ fast-bunkai は、日本語・英語の長い文章を自然な文単位に切り出すための Python ライブラリです。Pythonのみで書かれた megagonlabs/bunkai と高い互換性がある API を提供しつつ、内部を Rust で最適化することで、オリジナルの Python 版と比べ約40〜285倍の高速化を実現しています。既存の bunkai ユーザは from fast_bunkai import FastBunkai as Bunkai と書き換えるだけで簡単に移行できます。
目次|Table of Contents
- ✨ Highlights
- 🚀 Quick Start
- 🧰 CLI Examples
- 📊 Benchmarks
- 🧠 Architecture Snapshot
- 🛠️ Development Workflow
- 🧪 Testing & Quality Gates
- 🙏 Acknowledgements
- 📄 License
- 👤 Author
- 🔁 Drop-in replacement: mirrors the
FastBunkai/BunkaiAPIs and annotations, including Janome-based morphological spans. - 🦀 Rust-powered core: heavy annotators (facemark, emoji, dot exceptions, indirect quotes, etc.) run inside a PyO3 module that releases the Python GIL.
- ⚡ Serious speed: real-world workloads observe 40×–285× faster segmentation than pure Python bunkai (details below).
- 🧵 Thread-safe by design: no global mutable state; calling
FastBunkaiconcurrently from threads or asyncio tasks is supported. - 🛫 CLI parity: ships a
fast-bunkaiexecutable compatible with bunkai’s pipe-friendly interface and--mamorphological mode.
uv pip install fast-bunkaifrom fast_bunkai import FastBunkai
splitter = FastBunkai()
text = "羽田から✈️出発して、友だちと🍣食べました。最高!また行きたいな😂でも、予算は大丈夫かな…?"
for sentence in splitter(text):
print(sentence)Output:
羽田から✈️出発して、友だちと🍣食べました。
最高!
また行きたいな😂
でも、予算は大丈夫かな…?
fast-bunkai provides the same pipe-friendly command-line interface as bunkai.
echo -e '宿を予約しました♪!▁まだ2ヶ月も先だけど。▁早すぎかな(笑)楽しみです★\n2文書目です。▁改行を含みます。' \
| uvx fast-bunkaiOutput (sentence boundaries marked with │, newlines preserved via ▁):
宿を予約しました♪!▁│まだ2ヶ月も先だけど。▁│早すぎかな(笑)│楽しみです★
2文書目です。▁│改行を含みます。
Morphological output is also available:
echo -e '形態素解析し▁ます。結果を 表示します!' | uvx fast-bunkai --ma形態素 名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析 名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
▁
EOS
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。 記号,句点,*,*,*,*,。,。,。
EOS
結果 名詞,副詞可能,*,*,*,*,結果,ケッカ,ケッカ
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
記号,空白,*,*,*,*, ,*,*
表示 名詞,サ変接続,*,*,*,*,表示,ヒョウジ,ヒョージ
し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
! 記号,一般,*,*,*,*,!,!,!
EOS
Reproduce the bundled benchmark suite (correctness check + timing vs. bunkai):
uv run python scripts/benchmark.py --repeats 3 --jp-loops 100 --en-loops 100 --custom-loops 10Latest local run (2025-10-11) reported:
| Corpus | Docs | bunkai (mean) | fast-bunkai (mean) | Speedup |
|---|---|---|---|---|
| Japanese | 200 | 253.92 ms | 5.55 ms | 45.72× |
| English | 200 | 209.77 ms | 4.94 ms | 42.48× |
| Long text* | 20 | 1330.95 ms | 4.67 ms | 285.10× |
*Long text corpus contains mixed Japanese/English paragraphs with emojis and edge cases; the Rust pipeline processes characters in a single pass, whereas pure Python bunkai stacks regex scans, so the gap widens dramatically on longer documents.
Actual numbers vary by hardware, but the Rust core consistently outperforms pure Python bunkai by an order of magnitude or more.
- 🦀 Rust core (
src/lib.rs): facemark & emoji annotators, dot/number exceptions, indirect quote handling, and more. Uses PyO3abi3bindings and releases the GIL withpy.allow_threads. - 😀 Emoji metadata (
src/emoji_data.rs): generated viascripts/generate_emoji_data.py, mapping Unicode codepoints to bunkai-compatible categories. - 🐍 Python layer (
fast_bunkai/): wraps the Rustsegmentfunction, mirrors bunkai annotations with dataclasses, and builds Janome spans throughMorphAnnotatorJanomefor drop-in parity.
uv sync --reinstall
uv run python scripts/generate_emoji_data.py # regenerate emoji table when emoji libs change
uv run tox -e pytests,lint,typecheck,rust-fmt,rust-clippyFor manual Rust checks:
cargo test face_mark_detection_matches_reference
cargo fmt --all
cargo clippy --all-targets -- -D warnings- ✅ pytest (
tests/test_compatibility.py): ensures Japanese・English texts, emoji-heavy samples, and parallel execution match bunkai outputs. - 🧹 Ruff: lint + format checks via
tox -e lint,format-check. - 🧠 Pyright: type-checks the Python API surface.
- 🧪 Rust unit tests: validate annotator logic remains in sync with reference behaviour.
- 📈 Benchmarks:
scripts/benchmark.pyvalidates speed + correctness; normally executed in CI to avoid long local runs.
FastBunkai stands on the shoulders of the megagonlabs/bunkai project—ありがとうございます!
Apache License 2.0
Yuichi Tateno (@hotchpotch)