Intelligent Compression Proxy for Large Language Model APIs
KrunchWrapper is a sophisticated, high-performance compression proxy that acts as a middleman between your applications and LLM APIs. It intelligently compresses prompts using dynamic analysis to reduce token count, forwards requests to target LLMs, and decompresses responses - all while maintaining full OpenAI API compatibility.
- Content-Agnostic Analysis: Analyzes each prompt on-the-fly to find the most valuable compression patterns
- Model-Aware Validation: Uses correct tokenizers (tiktoken, transformers, SentencePiece) to ensure real token savings
- Multi-Pass Optimization: Advanced compression with up to 3 optimization passes for maximum efficiency
- Conversation State Management: Maintains compression context across conversation turns for improved efficiency
- OpenAI-Compatible: Drop-in replacement for OpenAI API - just change the
base_url
- Multi-Provider Support: Works with any OpenAI-compatible API (LocalAI, Ollama, etc.)
- Native Anthropic Support: Direct Claude API integration with native format support
- Intelligent Interface Detection: Auto-detects Cline, WebUI, SillyTavern, and Anthropic requests
- Streaming Support: Full support for both streaming and non-streaming responses
- Multiple Endpoints:
/v1/chat/completions
,/v1/completions
,/v1/embeddings
,/v1/models
- Async Logging: 1000x performance improvement with non-blocking logging system
- Persistent Token Cache: Intelligent caching with automatic cleanup and memory management
- Optimized Model Validation: 95%+ faster cached validations with thread-safe operations
- Adaptive Threading: Multi-threaded compression analysis with intelligent thread scaling
- Comment Stripping: Optional removal of code comments with language-specific safety rules
- Tool Call Protection: Automatically preserves JSON tool calls and structured data
- Markdown Preservation: Maintains formatting for tables, lists, and links
- System Prompt Intelligence: Advanced system prompt interception and merging
# Clone the repository
git clone https://github.com/thad0ctor/KrunchWrapper.git
cd KrunchWrapper
# Run the installation script
./install.sh # Linux/Mac
# or
.\install.ps1 # Windows
# Start the server (automatically starts on port 5002)
./start.sh # Linux/Mac
# or
.\start.ps1 # Windows
# This will start both the KrunchWrapper server and the WebUI
# Server: http://localhost:5002
# WebUI: http://localhost:5173
import openai
# Point to your KrunchWrapper server
client = openai.OpenAI(
base_url="http://localhost:5002/v1",
api_key="dummy-key" # Not used but required by the client
)
# Use exactly like a regular OpenAI client
response = client.chat.completions.create(
model="your-model-name",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a Python function to calculate Fibonacci numbers."}
]
)
print(response.choices[0].message.content)
β‘ Quick Setup for Cline Users:
-
Configure Cline Settings - Create/edit
.vscode/settings.json
in your project root:{ "cline.anthropicBaseUrl": "http://localhost:5002", "cline.anthropicApiKey": "sk-ant-your-actual-anthropic-api-key-here" }
π File Location Example:
your-project/ βββ .vscode/ β βββ settings.json β Create this file here βββ src/ βββ README.md
-
Start KrunchWrap - Run the server (default port 5002):
./start.sh # Linux/Mac .\start.ps1 # Windows
-
Use Cline Normally - KrunchWrap automatically:
- π Detects Cline requests via auto-detection
- ποΈ Compresses prompts before sending to Anthropic
- β¨ Decompresses responses back to Cline
- π° Saves 15-40% tokens on every request
π― Key Points:
- Port: Use
5002
(KrunchWrap server port) - No
/v1/messages
: Don't add endpoint paths to base URL - Real API Key: Replace with your actual
sk-ant-...
Anthropic key - Auto-Detection: No manual configuration needed - works automatically!
π Troubleshooting:
- Not seeing requests in terminal? Set
"log_level": "DEBUG"
inconfig/server.jsonc
- Still no activity? Check your API key starts with
sk-ant-
and restart VS Code - 404 errors? Restart KrunchWrap server after adding Anthropic integration
For non-Cline usage, KrunchWrap provides native Anthropic API support:
import anthropic
# Point to KrunchWrapper for automatic compression
client = anthropic.Anthropic(
api_key="your-anthropic-api-key",
base_url="http://localhost:5002" # KrunchWrap proxy URL
)
# Native Anthropic API format with automatic compression
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
system="You are a helpful coding assistant.",
messages=[
{"role": "user", "content": "Write a Python function to calculate factorial."}
],
max_tokens=1024
)
Features:
- π― Auto-Detection: Automatically detects Anthropic API requests
- π Native Format: Supports Anthropic's system parameter structure
- ποΈ Full Compression: 15-40% token savings with Claude models
- β‘ Streaming Support: Real-time response streaming
- ποΈ Multiple Interfaces: Works with SDK, HTTP requests, and other frontends
See documentation/ANTHROPIC_INTEGRATION.md
for complete usage guide.
KrunchWrapper includes pre-configured setups for common scenarios. Simply edit config/server.jsonc
and uncomment the configuration you want to use:
Perfect for LM Studio, Ollama, Text Generation WebUI, vLLM, LocalAI, etc.
Flow: Client β KrunchWrap (compression) β Local Server β External APIs
Configuration: Already active in config/server.jsonc
{
"target_host": "localhost",
"target_port": 1234, // Change to match your server
"target_use_https": false,
"api_key": ""
}
Common Local Server Ports:
- LM Studio:
1234
- Ollama:
11434
- Text Generation WebUI:
5000
or7860
- vLLM:
8000
- LocalAI:
8080
Client Setup Options:
- π Embedded WebUI:
http://localhost:5173
(starts automatically - recommended for beginners!) - π SillyTavern: API URL =
http://localhost:5002/v1
- π§ Cline: Use OpenAI provider with
http://localhost:5002/v1
Perfect for Cline with direct Anthropic API access.
Flow: Cline β KrunchWrap (compression) β api.anthropic.com
Status: β
Fully tested and debugged - compression fix implemented
Configuration: In config/server.jsonc
, comment out localhost config and uncomment:
{
// "target_host": "api.anthropic.com",
// "target_port": 443,
// "target_use_https": true,
// "api_key": "sk-ant-your-actual-anthropic-api-key-here"
}
Cline Setup (.vscode/settings.json
):
{
"cline.anthropicBaseUrl": "http://localhost:5002",
"cline.anthropicApiKey": "sk-ant-your-actual-anthropic-api-key-here"
}
The direct Anthropic integration required significant debugging and fixes. Direct OpenAI integration may have similar issues that need to be resolved.
Use at your own risk - may not work properly without additional development.
For reliable OpenAI access, use the Local Server setup with your local proxy.
Theoretical Configuration: In config/server.jsonc
:
{
// "target_host": "api.openai.com", // NOT TESTED
// "target_port": 443, // MAY NOT WORK
// "target_use_https": true, // EXPERIMENTAL
// "api_key": "sk-your-actual-openai-api-key-here"
}
Additional configurations in config/server.jsonc
are not implemented or tested:
- Google Gemini: Would need custom endpoint handlers and testing
- DeepSeek: Would need testing and possible custom handling
- Custom Remote Servers: Only works if server uses OpenAI-compatible format
To actually implement these: See documentation/EXTENDING_KRUNCHWRAP.md
for development guide
KrunchWrap includes a built-in browser-based chat interface that automatically gets compression benefits:
π Quick Start:
- Run
./start.sh
(Linux/Mac) or.\start.ps1
(Windows) - Open
http://localhost:5173
in your browser - Start chatting with automatic 15-40% token compression!
Features:
- π± Responsive design (works on desktop and mobile)
- ποΈ Automatic compression on all messages
- βοΈ Built-in settings panel
- π¨ Modern React-based interface
- π§ No external client configuration needed
Flow: Browser β WebUI (5173) β KrunchWrap (5002) β Your Local Server
- Open
config/server.jsonc
- Comment out current active configuration (add
//
before each line) - Uncomment your desired configuration (remove
//
from each line) - Update any specific values (ports, API keys, etc.)
- Restart KrunchWrap:
./start.sh
or.\start.ps1
KrunchWrapper can be configured via command line arguments, environment variables, or JSON configuration files.
{
"host": "0.0.0.0",
"port": 5002,
"target_host": "localhost",
"target_port": 1234,
"min_compression_ratio": 0.05,
"api_key": "your-llm-api-key",
"verbose_logging": false,
"file_logging": true,
"log_level": "INFO"
}
{
"compression": {
"min_characters": 250,
"threads": 4,
"min_token_savings": 1,
"min_compression_ratio": 0.05,
"aggressive_mode": false,
"large_file_threshold": 5000,
"cline_preserve_system_prompt": true,
"selective_tool_call_compression": true
},
"dynamic_dictionary": {
"enabled": true,
"compression_threshold": 0.01,
"multipass_enabled": true,
"max_passes": 3
},
"comment_stripping": {
"enabled": true,
"preserve_license_headers": true,
"preserve_shebang": true,
"preserve_docstrings": true
},
"conversation_compression": {
"kv_cache_threshold": 20
},
"streaming": {
"preserve_sse_format": true,
"validate_json_chunks": true,
"cline_compatibility_mode": true
},
"model_tokenizer": {
"custom_model_mappings": {
"qwen3": ["qwen3", "qwen-3", "your-custom-qwen3-variant"]
}
},
"logging": {
"verbose": true,
"console_level": "DEBUG"
}
}
Variable | Default | Description |
---|---|---|
KRUNCHWRAPPER_PORT |
5002 |
Server port |
KRUNCHWRAPPER_HOST |
0.0.0.0 |
Server host |
LLM_API_URL |
http://localhost:1234/v1 |
Target LLM API URL |
MIN_COMPRESSION_RATIO |
0.05 |
Minimum compression ratio |
KRUNCHWRAPPER_VERBOSE |
false |
Enable verbose logging |
KRUNCHWRAPPER_FILE_LOGGING |
false |
Enable file logging |
KrunchWrapper provides comprehensive performance metrics for every request:
π Performance Metrics (chat/completions):
t/s (avg): 42.5 // Average tokens per second
pp t/s: 150.2 // Prompt processing tokens per second
gen t/s: 35.8 // Generation tokens per second
compression %: 15.3% // Content size reduction
compression tokens used: 89 // Tokens saved through compression
total context used: 1,847 // Total tokens consumed
input tokens: 1,245 // Input tokens
output tokens: 602 // Generated tokens
total time: 1.85s (prep: 0.08s, llm: 1.77s) // Timing breakdown
Enable detailed content logging to see exactly what's being compressed:
π Verbose Logging (chat/completions):
================================================================================
π ORIGINAL MESSAGES:
[user] Here's some Python code that needs optimization...
ποΈ COMPRESSED MESSAGES:
[user] Here's Ξ± code Ξ² optimization...
π€ LLM RESPONSE:
Great question! Here are several ways to optimize your Python code...
================================================================================
KrunchWrapper automatically:
- Analyzes Content: Identifies repeated patterns, tokens, and structures
- Generates Symbols: Assigns optimal Unicode symbols from priority-based pools
- Validates Efficiency: Uses model-specific tokenizers to ensure real token savings
- Adds Decoder: Includes minimal decompression instructions only when beneficial
- Decompresses Responses: Restores original tokens in responses seamlessly
- Normal Mode (250-999 characters): Token-optimized compression prioritizing actual token savings
- Aggressive Mode (1000+ characters): Character-optimized compression for maximum reduction
- Multipass Mode: Up to 3 optimization passes for complex content
- 20-30% compression for typical source code files
- 40-50% compression for files with repeated patterns
- 10-15% compression for unique/generated content
- 30-60% additional savings with comment stripping enabled
KrunchWrapper automatically detects your model and uses the appropriate tokenizer for accurate token counting:
Model Family | Detection Patterns | Tokenizer Library | Examples |
---|---|---|---|
OpenAI | gpt-4 , gpt-3.5 , turbo |
tiktoken | gpt-4 , gpt-3.5-turbo , openai/gpt-4 |
Anthropic | claude , anthropic |
SentencePiece | claude-3-5-sonnet , anthropic/claude-3-haiku |
LLaMA | llama , llama2 , llama-3 |
SentencePiece/tiktoken | meta-llama/Llama-3-8B-Instruct , llama-2-7b |
Mistral | mistral , mixtral |
SentencePiece | mistralai/Mistral-7B-Instruct , mixtral-8x7b |
Qwen | qwen , qwen2 , qwen3 |
tiktoken | Qwen/Qwen2.5-Coder-32B-Instruct , qwen-7b |
gemini , bard , palm |
SentencePiece | google/gemini-pro , palm2 |
|
Others | yi- , deepseek , phi- |
Various | 01-ai/Yi-34B-Chat , deepseek-coder , microsoft/phi-2 |
π Step-by-Step Guide:
-
Edit your configuration file (
config/config.jsonc
):{ "model_tokenizer": { "custom_model_mappings": { "gpt-4": ["my-custom-gpt", "exactly-a"], "claude": ["my-claude", "internal-assistant"], "llama": ["local-llama", "company-llm"] } } }
-
Restart KrunchWrapper to reload the configuration:
./start.sh # Linux/Mac # or .\start.ps1 # Windows
-
Verify in logs that custom mappings are loaded:
INFO - Loading 3 custom model mappings INFO - Extended gpt-4 patterns with: ['my-custom-gpt', 'exactly-a']
-
Test your model detection:
# Your API calls with custom model names will now work: curl -X POST http://localhost:5002/v1/chat/completions \ -d '{"model": "my-custom-gpt", "messages": [...]}'
β Common Issue: WARNING - Unknown model family for: a
This warning appears when your model name doesn't match any supported patterns. The system falls back to character-based estimation, which still works but is less accurate.
β Solutions:
- Check your API configuration - Ensure you're sending a real model name like
gpt-4
instead of generic names like"a"
- Verify provider settings - Many providers allow setting the model name in environment variables or config files
- Add custom patterns - You can extend model detection in
config/config.jsonc
:
{
"model_tokenizer": {
"custom_model_mappings": {
"gpt-4": ["my-custom-gpt", "company-model"],
"claude": ["my-claude", "internal-assistant"],
"generic_model": ["exactly-a", "model-v1"]
}
}
}
- Patterns are matched as case-insensitive substrings
- Use specific patterns to avoid false matches (e.g.,
"exactly-a"
instead of"a"
) - Pattern
"a"
would incorrectly match"llama"
,"claude"
, etc. - All patterns are automatically converted to lowercase
Expected Model Names:
- β
gpt-4
,claude-3-5-sonnet
,llama-3-8b-instruct
- β
a
,model
,llm
,ai
Language-aware comment removal with safety features:
- Multi-language support: Python, JavaScript, C/C++, HTML, CSS, SQL, Shell
- Smart preservation: License headers, shebangs, docstrings
- Significant savings: 30-60% token reduction on heavily commented code
Advanced system prompt processing:
- Multi-source interception: Handles various prompt formats and sources
- Intelligent merging: Priority-based combination of user and compression instructions
- Format conversion: Seamless transformation between ChatML, Claude, Gemini formats
- Cline integration: Specialized handling for Cline development tool requests
- Anthropic Integration:
documentation/ANTHROPIC_INTEGRATION.md
- Native Claude API support guide - Logging Guide:
documentation/LOGGING_GUIDE.md
- Complete logging configuration guide - Debug Categories:
documentation/DEBUG_CATEGORIES.md
- Fine-grained debug logging control - Model Tokenizer Setup: How to Use Custom Model Mappings
- Troubleshooting: Model Detection Issues
- Configuration:
config/README.md
- Detailed configuration guide - API Reference:
api/README.md
- Complete API documentation - Architecture:
charts/README.md
- System flow diagrams
- Fix "Unknown model family" warning: Add custom patterns in
config/config.jsonc
- Test configuration: Run
python tests/test_case_insensitive_tokenizer.py
- Monitor performance: Enable verbose logging to see compression stats
- Troubleshoot compression: Check logs for compression ratios and token savings
api/
: FastAPI server and request handlingcore/
: Compression engine and intelligence modulesconfig/
: Configuration files and schemasdictionaries/
: Priority-based symbol pools for compressiondocumentation/
: Detailed feature documentationcharts/
: System flow diagrams and architecture chartstests/
: Comprehensive test suiteutils/
: Analysis and debugging utilities
- Dynamic Analysis:
dynamic_dictionary.py
- On-the-fly pattern analysis - Compression Engine:
compress.py
- Main compression orchestration - System Prompts:
system_prompt_interceptor.py
- Intelligent prompt handling - Model Validation:
model_tokenizer_validator.py
- Accurate token counting - Performance:
async_logger.py
,persistent_token_cache.py
- High-performance utilities
- Enabled by default for 1000x performance improvement
- Environment detection: Smart defaults for development vs production
- 100,000+ messages/second throughput capability
- Result caching: 95%+ faster for repeated validations
- Batch operations: Efficient processing of multiple validations
- Thread safety: Proper locking with no performance penalty
- Intelligent caching: Automatic cleanup and memory management
- Disk persistence: Survives server restarts
- Statistics monitoring: Built-in performance tracking
KrunchWrapper: Making LLM APIs more efficient, one token at a time. ποΈβ¨
If you get this error when running .\start.ps1
or ./start.sh
, it means the virtual environment isn't properly activated. This has been fixed in recent versions, but if you encounter it:
Solution:
- Run the install script first:
.\install.ps1
(Windows) or./install.sh
(Linux/Mac) - Verify installation: The install script now includes dependency verification
- Try starting again:
.\start.ps1
or./start.sh
Manual verification (if needed):
# Activate virtual environment manually
.venv\Scripts\Activate.ps1 # Windows PowerShell
# or
source .venv/bin/activate # Linux/Mac
# Test dependencies
python -c "import uvicorn, fastapi; print('β
Dependencies OK')"
# Start server using the provided scripts
./start.sh # Linux/Mac
# or
.\start.ps1 # Windows
This is normal behavior! The start script:
- β Shows startup message
- β Creates a separate window where the server runs
- β Returns control to your original terminal
Look for a new PowerShell/Terminal window where the actual services are running.
If only the server (port 5002) starts but not the WebUI (port 5173):
- Check Node.js:
node --version
andnpm --version
- Install WebUI deps:
cd webui && npm install
- Start manually:
cd webui && npm run dev
Symptoms:
- API requests fail when sending very short messages (1-5 characters)
- Logs show "KV cache threshold: 0 chars"
- Compression disabled due to "poor efficiency trend"
Cause: KV cache optimization is disabled when threshold is set to 0
Solution:
// In config/config.jsonc
"conversation_compression": {
"kv_cache_threshold": 20 // Enable KV cache for messages < 20 chars
}
Symptoms:
- Logs show "tiktoken not available", "transformers not available"
- All tokenizer validation falls back to character estimation
- Poor compression efficiency calculations
Solution:
# Install required tokenizer libraries
source venv/bin/activate
pip install tiktoken transformers sentencepiece
Symptoms:
- Logs show "Unknown model family for: [model_name]"
- Model-specific tokenization falls back to generic methods
Solution:
Add custom model mappings in config/config.jsonc
:
"model_tokenizer": {
"custom_model_mappings": {
"qwen3": ["qwen3", "qwen-3", "your-custom-qwen3-variant"]
}
}
Symptoms:
- Responses not showing up in Cline/Cursor
- "Unexpected API Response" errors
- SSE streaming failures
Solution:
Ensure proper configuration in config/config.jsonc
:
"compression": {
"cline_preserve_system_prompt": true,
"selective_tool_call_compression": true
},
"streaming": {
"preserve_sse_format": true,
"validate_json_chunks": true,
"cline_compatibility_mode": true
}
- Set
"kv_cache_threshold": 0
to disable KV cache - Enable
"multi_pass_adaptive": true
for advanced compression - Increase
"max_dictionary_size": 300
for larger dictionaries
- Set
"kv_cache_threshold": 30
for aggressive KV cache usage - Enable
"smart_decompression": true
for faster streaming - Use
"min_compression_ratio": 0.05
to skip marginal compression
Enable verbose logging to troubleshoot issues:
"logging": {
"verbose": true,
"console_level": "DEBUG"
}
Common debug patterns to look for:
π [KV CACHE]
- KV cache optimization triggersποΈ Dynamic compression
- Compression analysisβ Error in
- System errors requiring attentionβ οΈ WARNING
- Non-critical issues that may affect performance