Tiny UNIX-style tool for prompt compression using LLMLingua-2 that preserves meaning while cutting token count, saving latency and money.
# with uv (recommended)
uv pip install -e .
# or with pip
pip install -e .
Requires Python 3.9+.
# compress from stdin (default ratio 0.7)
echo "long prompt" | tinyprompt
# compress a file
tinyprompt --input prompt.txt
# compress from clipboard (macOS)
pbpaste | tinyprompt
# JSON output (prints {"compressed_prompt": ...})
tinyprompt --input prompt.txt --format json
# faster defaults on CPU
tinyprompt --input prompt.txt --fast --cpu
# keep a warm server running
tinyprompt --serve --port 8012
# forward CLI to the warm server
tinyprompt --input prompt.txt --server-url http://127.0.0.1:8012
--ratio FLOAT
(default 0.7): target compression level--target-tokens INT
: aim for a fixed token budget (overrides ratio)--input PATH|TEXT
: file path or literal text (use-
for stdin)--format {text,json}
: output format--fast
: speed‑tuned defaults (works great on CPU)--cpu
: force CPU (ignore CUDA/MPS)--threads INT
: limit CPU threads--cache-dir PATH
: set HF cache location--offline
: use local cache only (no downloads)- Server:
--serve
,--port
,--server-url
TINYPROMPT_MODEL
– override model idTINYPROMPT_RATIO
– default ratio when--ratio
not passedTINYPROMPT_PORT
– default port for--serve
HF_HOME
,TRANSFORMERS_CACHE
– Hugging Face cache dirHF_HUB_OFFLINE
,TRANSFORMERS_OFFLINE
– offline mode
- First run is slow? Models download once to your HF cache. Reuse with
--serve
. - Port in use? Pick a different
--port
. - Need fully offline? Run once online, then use
--offline
. - Want fewer tokens, not a ratio? Use
--target-tokens
.
uv pip install -e .[test]
uv run pytest -q