Skip to content

Latest commit

 

History

History
63 lines (54 loc) · 4.01 KB

File metadata and controls

63 lines (54 loc) · 4.01 KB

Blog Post Outline: How I Built the World’s Fastest Whitespace Trimmer

Working Title

  • How I Built the World’s Fastest Whitespace Trimmer

Hook & Motivation

  • lefthook-driven workflow; editor auto-trim usually good enough.
  • Recent surge of generated/AI-written code sneaking in trailing spaces and missing final newlines.
  • Existing solutions (bash script, Python pre-commit hook, npm package, other Rust tool) would work but feel unsatisfying.
  • Reading Bun’s performance blog post rekindled the idea: build something absurdly fast for fun and for the blog.

Background & Context

  • DocSpring focus: useful open source tools/blog posts for developers; no interest in marketing fluff.
  • Personal curiosity about algorithms/performance despite limited formal CS background.
  • Renamify boilerplate: copied repo to trim-trailing-whitespace, used Renamify to rewrite identifiers.
  • Initial cleanup: delete MCP server, VS Code extension, legacy docs, all original Rust code; keep Taskfiles, lefthook, CI scaffolding.

Early Ideas & Questions

  • Bloom filter cache: auto-sized, tuned on DocSpring repo; handle false positives via resizes.
  • Where to store cache? Options: temp dir, .git, OS cache dirs.
  • Compiling .gitignore: convert patterns into compact machine-friendly representation; rebuild on mtime changes/additions.
  • Per-file delta detection: subdivide files into segments; can we avoid reading every byte each run?
  • Quick research (ChatGPT): deleting arbitrary bytes without rewriting is mostly impossible; file rewrite is still best.

Defining the Scope

  • Deliverables: Rust core crate + CLI only, whitespace transformations only.
  • Respect Git’s ignore configuration (.gitignore, .git/info/exclude, global excludes).
  • No IDE/MCP integrations; project is a one-off experiment tied to the blog post.
  • Performance-first mindset: compile-time feature flags to toggle optimisations, benchmarking pipeline to show improvements.

Architecture Plan Highlights

  • whitespace-core: transcode/trim logic, line ending normalization, tab/space conversion, SIMD implementations.
  • whitespace-cli: Clap-based binary, recursive walker, binary detection, --check mode, feature toggles.
  • Cache design:
    • Cache paths per OS; repo ID via blake3-128 of canonical root + volume info.
    • Files: keys.bin (sorted u128 keys), meta.bin (header, OS journal checkpoints, repo stats), lock guard.
    • Key formula: blake3-128(path, dev, ino, size, mtime_ns, ctime_ns, mode).
    • Racy guard: rescan when mtime_sec == index_write_time_sec.
    • Change detection: macOS FSEvents, Windows USN Journal, Linux fallback walk.

Implementation Strategy

  • Build naive baseline implementation first (byte-by-byte, no cache, always rewrite).
  • Add SIMD and other optimisations behind cargo features.
  • Introduce compile-time feature matrix (baseline → simd → parallel walk → cache → mmap) for benchmarks.
  • Use hyperfine to benchmark each tier on /Users/ndbroadbent/code/docspring (cold vs warm cache).
  • Document results with charts comparing stages (baseline vs optimised).

Content Sections (Rough)

  1. Introduction + why the tool exists.
  2. Scaffolding and cleaning the repo.
  3. Exploring performance ideas in the car-side brain dump (Bloom filters, gitignore compilation, cache placement).
  4. Lessons from talking to AI (file rewrite reality, SIMD suggestions, other micro-optimisations).
  5. Designing the definitive cache (keys, meta, OS journals, racy guard).
  6. Naive baseline implementation and first measurements.
  7. Iterative optimisations with feature flags + benchmark plots.
  8. Integrating with lefthook and everyday workflow.
  9. Final thoughts (was it worth it? fun factor? what’s next optional, but keep minimal until work is done).

Notes & TODOs

  • Insert references/links: Bun performance blog, API client post, lefthook, DocSpring.
  • Collect benchmark data with hyperfine once implementations land.
  • Generate diagrams for cache structure (keys.bin, meta.bin).
  • Keep tone conversational, emphasise experimentation over production dogma.