Clippy — Style-Aware Image Retrieval & Generation

Clippy is a FastAPI service and script suite for style-aware visual retrieval backed by Qdrant. It accepts a sketch and/or text prompt, runs hybrid similarity search over image/edge/text embeddings, and returns reference images. Generative Image models typically perform much better when references are provided. Especially during style-transfer workflows (transferring one artstyle to the other), references help the model to understand better the differences. The project aims to refine the generative workflow of image generation models by making it easier for artists to retrieve references across various artstyles and use them further to generate better images without having any prior knowledge of prompt engineering.

🧰 Tech Stack

Layer	Primary	Notes / Optional
Language & Runtime	Python 3.10+	venv/uv recommended
API Framework	FastAPI, Uvicorn	Pydantic for schema validation
Embeddings	PyTorch, open-clip-torch	`Pillow`, `opencv-python` for I/O & edge-maps
Vector Database	Qdrant Server with 120,000 images of digital artworks, qdrant-client	HNSW index; supports named vectors
Storage	Filesystem under `IMAGES_ROOT`	Image paths stored in Qdrant payloads
Config	python-dotenv	`.env` in repo root
Observability/Utils	`tqdm` (progress), logging via Uvicorn/stdlib
Testing & Tools	pytest, httpie/curl, VS Code REST Client
Frontend Client	Vite + React + TypeScript	Tailwind CSS, shadcn/ui, lucide-react, Fabric.js for sketch UI;
Edit/Gen	Vertex AI / SDXL workflow

🐳 Docker Setup (Not Tested)

This is the easiest way to get started with Clippy.

Requirements

Docker and Docker Compose

1) Configure environment

Create a .env file in the repository root. The docker-compose.yml file is configured to load this file, so all environment variables for the Docker setup should be managed here.

You can copy the example below, but be sure to fill in your Google Cloud credentials if you want to use the image generation features.

# Qdrant - These are the defaults for the docker-compose setup
QDRANT_URL=http://qdrant:6333
QDRANT_API_KEY= #optional
QDRANT_COLLECTION=safebooru_union_clip

# Embeddings
OPENCLIP_MODEL=ViT-bigG-14
OPENCLIP_PRETRAINED=laion2b_s39b_b160k
OPENCLIP_DEVICE=cpu            # or cuda if you have a GPU and nvidia-docker
OPENCLIP_PRECISION=fp32

# Ingestion
IMAGES_ROOT=/app/data/safebooru # This path is inside the container

# Image Gen (Optional)
GOOGLE_GENAI_USE_VERTEXAI=true
GOOGLE_CLOUD_PROJECT= # your-gcp-project-id
GOOGLE_CLOUD_LOCATION= # us-central1
GOOGLE_APPLICATION_CREDENTIALS= # /app/gcp-credentials.json
PROMPT_LOG_LEVEL=DEBUG
GEMINI_VISION_MODEL=gemini-2.5-flash

2) Run the application

docker-compose up --build

This command will:

Build the frontend and backend Docker images.
Start the FastAPI application and the Qdrant vector database.

You can access the application at http://localhost:8000.

3) Ingest data

To ingest your data, you'll need to run the ingestion scripts inside the app container.

First, place your images in the data/safebooru directory (or the directory you specified in IMAGES_ROOT).

Then, run the following commands:

# Initialize the Qdrant collection
docker-compose exec app python lib/qdrant_init.py \
  --qdrant-url $QDRANT_URL \ 
  --collection $QDRANT_COLLECTION \
  --model ViT-bigG-14 \
  --pretrained laion2b_s39b_b160k \
  --device cuda \
  --add-edge-vector

# Download the dataset to the container or add the volume to docker-compose.yml
docker-compose exec app python scripts/retrieve_safebooru.py \
  --tags-file ./tags.txt \
  --out ./data/safebooru \
  --union --workers 16 --max-pages 20 --resume

# Ingest the images
docker-compose exec app python qdrant/embed_and_upsert.py \
  --manifest "$MANIFEST" \
  --qdrant-url "$QDRANT_URL" \
  --collection "$QDRANT_COLLECTION" \
  --model "$MODEL_NAME" --pretrained "$PRETRAINED" \
  --device "$DEVICE" \
  --clip-batch 4 --upsert-batch 64 --gc-every 512 \
  --text-template "an illustration with {tags}"

4) Stop the application

docker-compose down

⛳ Manual Setup

Requirements

Python 3.10+
Qdrant (Docker recommended)
Optional GPU for faster embedding (OpenCLIP)
A dataset (e.g., your Danbooru/Safebooru subset) and tags.txt

1) Run Qdrant

docker run -p 6333:6333 -p 6334:6334 \
  -v $PWD/qdrant_storage:/qdrant/storage \
  qdrant/qdrant:latest

2) Create venv & install

python -m venv .venv
# macOS/Linux
source .venv/bin/activate
# Windows
# .\.venv\Scripts\activate

# Upgrade basics
pip install -U pip setuptools wheel

# Install project deps from requirements.txt
pip install -r requirements.txt

# (Optional) If Torch is NOT pinned in requirements.txt or you want a specific build:
# CUDA 12.1 (Linux/Windows GPU)
# pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio
# CPU-only (no GPU)
# pip install --index-url https://download.pytorch.org/whl/cpu torch torchvision torchaudio
# Apple Silicon (MPS): standard PyPI wheels are fine
# pip install torch torchvision torchaudio

# Install your package in editable mode (if you have pyproject.toml / setup.cfg)
pip install -e .

3) Configure environment

Create .env in the repo root:

# Qdrant
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY= #optional
QDRANT_COLLECTION=safebooru_union_clip

# Embeddings
OPENCLIP_MODEL=ViT-bigG-14
OPENCLIP_PRETRAINED=laion2b_s39b_b160k
OPENCLIP_DEVICE=cpu            # or cuda if you have a GPU and nvidia-docker
OPENCLIP_PRECISION=fp32

# Ingestion
IMAGES_ROOT=/abs/path/to/images

# Image Gen
GOOGLE_GENAI_USE_VERTEXAI=true
GOOGLE_CLOUD_PROJECT=
GOOGLE_CLOUD_LOCATION=
GOOGLE_APPLICATION_CREDENTIALS= #json key for service account
PROMPT_LOG_LEVEL=DEBUG
GEMINI_VISION_MODEL=gemini-2.5-flash

4) Initialise DB, Create collection & ingest

# Initialize DB with the required configuration
python lib/qdrant_init.py \
  --qdrant-url $QDRANT_URL \ 
  --collection $QDRANT_COLLECTION \
  --model ViT-bigG-14 \
  --pretrained laion2b_s39b_b160k \
  --device cuda \
  --add-edge-vector

# Retrieval Script retrieve_safebooru.py
python scripts/retrieve_safebooru.py \
  --tags-file ./tags.txt \
  --out ./data/safebooru \
  --union --workers 16 --max-pages 20 --resume


# Ingesting Dataset
python qdrant/embed_and_upsert.py \
  --manifest "$MANIFEST" \
  --qdrant-url "$QDRANT_URL" \
  --collection "$QDRANT_COLLECTION" \
  --model "$MODEL_NAME" --pretrained "$PRETRAINED" \
  --device "$DEVICE" \
  --clip-batch 4 --upsert-batch 64 --gc-every 512 \
  --text-template "an illustration with {tags}"

# Backfill missing vectors 
python scripts/backfill_vectors.py \
  --manifest "$MANIFEST" \
  --qdrant-url "$QDRANT_URL" \
  --collection "$QDRANT_COLLECTION" \
  --model "$MODEL_NAME" --pretrained "$PRETRAINED" \
  --device "$DEVICE" \
  --which both --clip-batch 4 --upsert-batch 64 --gc-every 512

5) Run the API

uvicorn main:app --reload --port 8000

🧭 How the Architecture Works

Components

Client (Vite + React + TS + Fabric.js)
Lets artists sketch, tune weights (wImg, wEdge, wTxt), and preview results.
FastAPI App (main.py, app/)
- Receives multipart or JSON requests.
- Preprocesses inputs (e.g., edge-map from sketch).
- Computes embeddings via OpenCLIP.
- Assembles a hybrid query with normalized weights.
- Calls Qdrant (ANN search) and optionally exact re-rank.
- Resolves path payloads to serve preview images (e.g., GET /image/{id}).
Embedding Workers (OpenCLIP)
- Image embeddings from the original images.
- Edge embeddings from edge-maps (sketch/shape signal).
- Text embeddings from prompts/tags/captions.
Qdrant (Vector DB)
- Stores vectors + payloads (path, tags, etc.).
- Prefer named vectors (image|edge|text) for flexible query-time weighting.
- HNSW index for fast approximate search; ef_search tunes quality/speed.
Image Store (filesystem path rooted at IMAGES_ROOT)
- Original assets referenced by payload path.
- API enforces safe path resolution (no escaping root).

Edit/Generation Provider
Pluggable adapter that powers /images/generate (see generate_image.py). Supports style references and subject references:

How to send references (provider-agnostic request shape)

The API accepts a refs array; each ref declares a role and a weight:

{
  "prompt": "pop art still life, magenta outline",
  "count": 1,
  "refs": [
    { "role": "style",   "id": 12345, "weight": 0.7 },   // style reference from retrieval
    { "role": "subject", "url": "https://.../apple.png", "weight": 0.6 }
  ],
  "sketch_weight": 0.0
}

Providers map these roles as follows:

Google Imagen 3 (Vertex AI) – Use style customization with one or more reference images to steer look & feel; mask-based editing is also supported. The adapter converts refs.role=="style" into Imagen’s style reference inputs; subject refs can be fed as additional reference images or via mask+edit flows depending on the task. (Google Cloud)
SDXL pipelines – Use image-to-image for coarse subject retention and add IP-Adapter (style / face / composition variants) for stronger style and identity adherence. This approach improves controllability and preserves style/subjects more reliably than prompt-only generation. (Hugging Face)

Why references help: Research on IP-Adapter shows that adding image prompts (style or subject) to diffusion models yields comparable or better results than fine-tuning for many tasks, and crucially improves style/identity control without changing the base model. In practice with SDXL, combining image-to-image + IP-Adapter often outperforms text-only prompts on style fidelity. (arXiv)

Recommended patterns

Style-first: 1–3 style refs, style_weight ≈ 0.5–0.8, moderate CFG/guidance; keep subject_weight low or 0.
Subject-first: 1–2 subject refs (clean, centered), subject_weight ≈ 0.6–0.9; optional low style_weight for finish.
SDXL img2img: start with a subject ref as the initial image, low denoise/strength (e.g., 0.2–0.35) to retain identity; layer IP-Adapter style refs for look. (Hugging Face)
Imagen editing: when you need to preserve a subject tightly, use mask-based edit with a subject ref and a mask to constrain changes. (Google Cloud)

Discussed more in → How_it_works

🔎 Retrieval Results

I tried abstract sketches with very satisfactory results.

Input:

Output:

Nearest Output:

Style transfer (from sketch)

Results while using style-transfer with Google Vertex Ai Imagen 3 model

Input:

Output using pixel art style:

Data Flow

Ingestion (offline):

Scan dataset → build file list + optional tags.
Compute embeddings (image, edge, text).
Upsert to Qdrant with payloads.
Index/Optimize HNSW for recall.

Query (online):

Client sends multipart (sketch/image) and/or JSON (queryText, weights, filters).
API preprocesses → embeds → weights → hybrid query.
Qdrant returns top-K; API optionally exact re-ranks; responses include id, score, path, payload.
Client previews images; edit/generation step can run thereafter.

🎨 Client (Vite + React + TypeScript)

Why this stack?

Vite: fast HMR and lean builds.
React + TS: type-safety and a mature ecosystem.
Fabric.js: rich HTML5 canvas sketching (brush, layers, undo).
Tailwind + shadcn/ui + lucide-react: rapid, modern UI.

🛠️ API (brief)

`POST /search/hybrid`

Hybrid similarity using image/edge/text with per-request weights.

`GET /image/{image_id}`

Serves the image resolved from payload path, with safe path checks.

`POST /images/generate`

Uses the provided references to generate images.

Last updated: 2025-09-16

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.idea		.idea
app		app
assets		assets
frontend		frontend
lib		lib
qdrant		qdrant
scripts		scripts
tests		tests
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
dockerignore		dockerignore
how-it-works.md		how-it-works.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
tags.txt		tags.txt
test_main.http		test_main.http

killmlana/Clippy

Folders and files

Latest commit

History

Repository files navigation

Clippy — Style-Aware Image Retrieval & Generation

🧰 Tech Stack

🐳 Docker Setup (Not Tested)

Requirements

1) Configure environment

2) Run the application

3) Ingest data

4) Stop the application

⛳ Manual Setup

Requirements

1) Run Qdrant

2) Create venv & install

3) Configure environment

4) Initialise DB, Create collection & ingest

5) Run the API

🧭 How the Architecture Works

Components

How to send references (provider-agnostic request shape)

Recommended patterns

🔎 Retrieval Results

Input:

Output:

Nearest Output:

Style transfer (from sketch)

Input:

Output using pixel art style:

Data Flow

🎨 Client (Vite + React + TypeScript)

Why this stack?

🛠️ API (brief)

POST /search/hybrid

GET /image/{image_id}

POST /images/generate

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`POST /search/hybrid`

`GET /image/{image_id}`

`POST /images/generate`

Packages