Skip to content

A fast, Unix-style CLI tool for semantic prompt compression. Cuts LLM prompt tokens by 10-20x with >90% fidelity, saving costs and latency.

License

Notifications You must be signed in to change notification settings

sidedwards/tinyprompt

Repository files navigation

tinyprompt

Tiny UNIX-style tool for prompt compression using LLMLingua-2 that preserves meaning while cutting token count, saving latency and money.

Install

# with uv (recommended)
uv pip install -e .
# or with pip
pip install -e .

Requires Python 3.9+.

Quick start

# compress from stdin (default ratio 0.7)
echo "long prompt" | tinyprompt

# compress a file
tinyprompt --input prompt.txt

# compress from clipboard (macOS)
pbpaste | tinyprompt

# JSON output (prints {"compressed_prompt": ...})
tinyprompt --input prompt.txt --format json

Improved performance

# faster defaults on CPU
tinyprompt --input prompt.txt --fast --cpu

# keep a warm server running
tinyprompt --serve --port 8012
# forward CLI to the warm server
tinyprompt --input prompt.txt --server-url http://127.0.0.1:8012

Common flags

  • --ratio FLOAT (default 0.7): target compression level
  • --target-tokens INT: aim for a fixed token budget (overrides ratio)
  • --input PATH|TEXT: file path or literal text (use - for stdin)
  • --format {text,json}: output format
  • --fast: speed‑tuned defaults (works great on CPU)
  • --cpu: force CPU (ignore CUDA/MPS)
  • --threads INT: limit CPU threads
  • --cache-dir PATH: set HF cache location
  • --offline: use local cache only (no downloads)
  • Server: --serve, --port, --server-url

Environment variables

  • TINYPROMPT_MODEL – override model id
  • TINYPROMPT_RATIO – default ratio when --ratio not passed
  • TINYPROMPT_PORT – default port for --serve
  • HF_HOME, TRANSFORMERS_CACHE – Hugging Face cache dir
  • HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE – offline mode

Troubleshooting

  • First run is slow? Models download once to your HF cache. Reuse with --serve.
  • Port in use? Pick a different --port.
  • Need fully offline? Run once online, then use --offline.
  • Want fewer tokens, not a ratio? Use --target-tokens.

Dev

uv pip install -e .[test]
uv run pytest -q

Links

About

A fast, Unix-style CLI tool for semantic prompt compression. Cuts LLM prompt tokens by 10-20x with >90% fidelity, saving costs and latency.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks