Skip to content
/ levy Public

Semantic caching engine for LLM APIs – experimental core for cost optimisation (MSc research project).

License

Notifications You must be signed in to change notification settings

AlejoJamC/levy

Repository files navigation

Levy

Levy is a semantic caching engine for LLM APIs, designed as a research prototype for a Computer Science Capstone project. It sits between your application and an LLM provider (like OpenAI) to optimize costs and latency by reusing responses for identical or semantically similar prompts.

Features

  • Exact Match Caching: Extremely fast retrieval for identical prompts.
  • Semantic Caching: Uses vector embeddings (via sentence-transformers) to find and reuse answers for similar meaning queries (e.g., "What is the capital of France?" vs "Tell me France's capital").
  • Metrics: Automatically tracks cache hit rates, latency and estimated token savings.
  • Pluggable Architecture: Easy to swap LLM providers or Vector Stores.

Project Structure

levy/
├── levy/               # Core package
│   ├── cache/          # Cache logic (Exact, Semantic, Store)
│   ├── llm_client.py   # LLM interaction (Mock, OpenAI)
│   ├── embeddings.py   # Vector embedding logic
│   ├── engine.py       # Main orchestration engine
│   └── models.py       # Data classes
├── examples/           # Demo scripts
└── tests/              # Unit tests

Installation

Using Conda (Recommended)

  1. Ensure you have Conda installed.
  2. Create the environment:
    conda env create -f environment.yml
  3. Activate the environment:
    conda activate levy

Usage

Quick Start (Python)

from levy import LevyEngine, LevyConfig

# Initialize with defaults (Mock LLM, Exact Cache only)
engine = LevyEngine()

# First call - hits the "LLM"
result1 = engine.generate("Hello world")
print(result1.source) # 'llm'

# Second call - hits the cache
result2 = engine.generate("Hello world")
print(result2.source) # 'exact_cache'

Running the Experiment Script

A replay script is provided to demonstrate the cache behavior:

python examples/simple_replay.py

It runs a sequence of prompts through three configurations:

  1. No Cache
  2. Exact Cache Only
  3. Semantic Cache (uses sentence-transformers if available)

Running with Ollama (Local Models)

  1. Install and run Ollama.
  2. Pull required models:
    ollama pull llama3.2
    ollama pull mxbai-embed-large
  3. Run the demo:
    python examples/ollama_demo.py

Using Redis Stack (Docker)

To use Redis for persistence:

  1. Start Redis:
    docker-compose up -d
  2. Configure LevyConfig to use cache_store_type="redis".

Configuration

You can configure Levy via LevyConfig:

config = LevyConfig(
    llm_provider="openai",
    openai_api_key="sk-...",
    enable_semantic_cache=True,
    similarity_threshold=0.85
)

License

Apache-2.0

About

Semantic caching engine for LLM APIs – experimental core for cost optimisation (MSc research project).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages