Clippy is a FastAPI service and script suite for style-aware visual retrieval backed by Qdrant. It accepts a sketch and/or text prompt, runs hybrid similarity search over image/edge/text embeddings, and returns reference images. Generative Image models typically perform much better when references are provided. Especially during style-transfer workflows (transferring one artstyle to the other), references help the model to understand better the differences. The project aims to refine the generative workflow of image generation models by making it easier for artists to retrieve references across various artstyles and use them further to generate better images without having any prior knowledge of prompt engineering.
| Layer | Primary | Notes / Optional |
|---|---|---|
| Language & Runtime | Python 3.10+ | venv/uv recommended |
| API Framework | FastAPI, Uvicorn | Pydantic for schema validation |
| Embeddings | PyTorch, open-clip-torch | Pillow, opencv-python for I/O & edge-maps |
| Vector Database | Qdrant Server with 120,000 images of digital artworks, qdrant-client | HNSW index; supports named vectors |
| Storage | Filesystem under IMAGES_ROOT |
Image paths stored in Qdrant payloads |
| Config | python-dotenv | .env in repo root |
| Observability/Utils | tqdm (progress), logging via Uvicorn/stdlib |
|
| Testing & Tools | pytest, httpie/curl, VS Code REST Client | |
| Frontend Client | Vite + React + TypeScript | Tailwind CSS, shadcn/ui, lucide-react, Fabric.js for sketch UI; |
| Edit/Gen | Vertex AI / SDXL workflow |
This is the easiest way to get started with Clippy.
- Docker and Docker Compose
Create a .env file in the repository root. The docker-compose.yml file is configured to load this file, so all environment variables for the Docker setup should be managed here.
You can copy the example below, but be sure to fill in your Google Cloud credentials if you want to use the image generation features.
# Qdrant - These are the defaults for the docker-compose setup
QDRANT_URL=http://qdrant:6333
QDRANT_API_KEY= #optional
QDRANT_COLLECTION=safebooru_union_clip
# Embeddings
OPENCLIP_MODEL=ViT-bigG-14
OPENCLIP_PRETRAINED=laion2b_s39b_b160k
OPENCLIP_DEVICE=cpu # or cuda if you have a GPU and nvidia-docker
OPENCLIP_PRECISION=fp32
# Ingestion
IMAGES_ROOT=/app/data/safebooru # This path is inside the container
# Image Gen (Optional)
GOOGLE_GENAI_USE_VERTEXAI=true
GOOGLE_CLOUD_PROJECT= # your-gcp-project-id
GOOGLE_CLOUD_LOCATION= # us-central1
GOOGLE_APPLICATION_CREDENTIALS= # /app/gcp-credentials.json
PROMPT_LOG_LEVEL=DEBUG
GEMINI_VISION_MODEL=gemini-2.5-flashdocker-compose up --buildThis command will:
- Build the frontend and backend Docker images.
- Start the FastAPI application and the Qdrant vector database.
You can access the application at http://localhost:8000.
To ingest your data, you'll need to run the ingestion scripts inside the app container.
First, place your images in the data/safebooru directory (or the directory you specified in IMAGES_ROOT).
Then, run the following commands:
# Initialize the Qdrant collection
docker-compose exec app python lib/qdrant_init.py \
--qdrant-url $QDRANT_URL \
--collection $QDRANT_COLLECTION \
--model ViT-bigG-14 \
--pretrained laion2b_s39b_b160k \
--device cuda \
--add-edge-vector
# Download the dataset to the container or add the volume to docker-compose.yml
docker-compose exec app python scripts/retrieve_safebooru.py \
--tags-file ./tags.txt \
--out ./data/safebooru \
--union --workers 16 --max-pages 20 --resume
# Ingest the images
docker-compose exec app python qdrant/embed_and_upsert.py \
--manifest "$MANIFEST" \
--qdrant-url "$QDRANT_URL" \
--collection "$QDRANT_COLLECTION" \
--model "$MODEL_NAME" --pretrained "$PRETRAINED" \
--device "$DEVICE" \
--clip-batch 4 --upsert-batch 64 --gc-every 512 \
--text-template "an illustration with {tags}"docker-compose down- Python 3.10+
- Qdrant (Docker recommended)
- Optional GPU for faster embedding (OpenCLIP)
- A dataset (e.g., your Danbooru/Safebooru subset) and
tags.txt
docker run -p 6333:6333 -p 6334:6334 \
-v $PWD/qdrant_storage:/qdrant/storage \
qdrant/qdrant:latestpython -m venv .venv
# macOS/Linux
source .venv/bin/activate
# Windows
# .\.venv\Scripts\activate
# Upgrade basics
pip install -U pip setuptools wheel
# Install project deps from requirements.txt
pip install -r requirements.txt
# (Optional) If Torch is NOT pinned in requirements.txt or you want a specific build:
# CUDA 12.1 (Linux/Windows GPU)
# pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio
# CPU-only (no GPU)
# pip install --index-url https://download.pytorch.org/whl/cpu torch torchvision torchaudio
# Apple Silicon (MPS): standard PyPI wheels are fine
# pip install torch torchvision torchaudio
# Install your package in editable mode (if you have pyproject.toml / setup.cfg)
pip install -e .Create .env in the repo root:
# Qdrant
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY= #optional
QDRANT_COLLECTION=safebooru_union_clip
# Embeddings
OPENCLIP_MODEL=ViT-bigG-14
OPENCLIP_PRETRAINED=laion2b_s39b_b160k
OPENCLIP_DEVICE=cpu # or cuda if you have a GPU and nvidia-docker
OPENCLIP_PRECISION=fp32
# Ingestion
IMAGES_ROOT=/abs/path/to/images
# Image Gen
GOOGLE_GENAI_USE_VERTEXAI=true
GOOGLE_CLOUD_PROJECT=
GOOGLE_CLOUD_LOCATION=
GOOGLE_APPLICATION_CREDENTIALS= #json key for service account
PROMPT_LOG_LEVEL=DEBUG
GEMINI_VISION_MODEL=gemini-2.5-flash# Initialize DB with the required configuration
python lib/qdrant_init.py \
--qdrant-url $QDRANT_URL \
--collection $QDRANT_COLLECTION \
--model ViT-bigG-14 \
--pretrained laion2b_s39b_b160k \
--device cuda \
--add-edge-vector
# Retrieval Script retrieve_safebooru.py
python scripts/retrieve_safebooru.py \
--tags-file ./tags.txt \
--out ./data/safebooru \
--union --workers 16 --max-pages 20 --resume
# Ingesting Dataset
python qdrant/embed_and_upsert.py \
--manifest "$MANIFEST" \
--qdrant-url "$QDRANT_URL" \
--collection "$QDRANT_COLLECTION" \
--model "$MODEL_NAME" --pretrained "$PRETRAINED" \
--device "$DEVICE" \
--clip-batch 4 --upsert-batch 64 --gc-every 512 \
--text-template "an illustration with {tags}"
# Backfill missing vectors
python scripts/backfill_vectors.py \
--manifest "$MANIFEST" \
--qdrant-url "$QDRANT_URL" \
--collection "$QDRANT_COLLECTION" \
--model "$MODEL_NAME" --pretrained "$PRETRAINED" \
--device "$DEVICE" \
--which both --clip-batch 4 --upsert-batch 64 --gc-every 512uvicorn main:app --reload --port 8000-
Client (Vite + React + TS + Fabric.js)
Lets artists sketch, tune weights (wImg,wEdge,wTxt), and preview results. -
FastAPI App (
main.py,app/)- Receives multipart or JSON requests.
- Preprocesses inputs (e.g., edge-map from sketch).
- Computes embeddings via OpenCLIP.
- Assembles a hybrid query with normalized weights.
- Calls Qdrant (ANN search) and optionally exact re-rank.
- Resolves
pathpayloads to serve preview images (e.g.,GET /image/{id}).
-
Embedding Workers (OpenCLIP)
- Image embeddings from the original images.
- Edge embeddings from edge-maps (sketch/shape signal).
- Text embeddings from prompts/tags/captions.
-
Qdrant (Vector DB)
- Stores vectors + payloads (
path,tags, etc.). - Prefer named vectors (
image|edge|text) for flexible query-time weighting. - HNSW index for fast approximate search;
ef_searchtunes quality/speed.
- Stores vectors + payloads (
-
Image Store (filesystem path rooted at
IMAGES_ROOT)- Original assets referenced by payload
path. - API enforces safe path resolution (no escaping root).
- Original assets referenced by payload
-
Edit/Generation Provider
Pluggable adapter that powers/images/generate(seegenerate_image.py). Supports style references and subject references:The API accepts a
refsarray; each ref declares a role and a weight:{ "prompt": "pop art still life, magenta outline", "count": 1, "refs": [ { "role": "style", "id": 12345, "weight": 0.7 }, // style reference from retrieval { "role": "subject", "url": "https://.../apple.png", "weight": 0.6 } ], "sketch_weight": 0.0 }
Providers map these roles as follows:
- Google Imagen 3 (Vertex AI) β Use style customization with one or more reference images to steer look & feel; mask-based editing is also supported. The adapter converts
refs.role=="style"into Imagenβs style reference inputs;subjectrefs can be fed as additional reference images or via mask+edit flows depending on the task. (Google Cloud) - SDXL pipelines β Use image-to-image for coarse subject retention and add IP-Adapter (style / face / composition variants) for stronger style and identity adherence. This approach improves controllability and preserves style/subjects more reliably than prompt-only generation. (Hugging Face)
Why references help: Research on IP-Adapter shows that adding image prompts (style or subject) to diffusion models yields comparable or better results than fine-tuning for many tasks, and crucially improves style/identity control without changing the base model. In practice with SDXL, combining image-to-image + IP-Adapter often outperforms text-only prompts on style fidelity. (arXiv)
- Style-first: 1β3 style refs,
style_weight β 0.5β0.8, moderate CFG/guidance; keepsubject_weightlow or 0. - Subject-first: 1β2 subject refs (clean, centered),
subject_weight β 0.6β0.9; optional lowstyle_weightfor finish. - SDXL img2img: start with a subject ref as the initial image, low denoise/strength (e.g., 0.2β0.35) to retain identity; layer IP-Adapter style refs for look. (Hugging Face)
- Imagen editing: when you need to preserve a subject tightly, use mask-based edit with a subject ref and a mask to constrain changes. (Google Cloud)
Discussed more in β How_it_works
I tried abstract sketches with very satisfactory results.
Results while using style-transfer with Google Vertex Ai Imagen 3 model
Ingestion (offline):
- Scan dataset β build file list + optional tags.
- Compute embeddings (
image,edge,text). - Upsert to Qdrant with payloads.
- Index/Optimize HNSW for recall.
Query (online):
- Client sends multipart (sketch/image) and/or JSON (
queryText, weights, filters). - API preprocesses β embeds β weights β hybrid query.
- Qdrant returns top-K; API optionally exact re-ranks; responses include
id,score,path,payload. - Client previews images; edit/generation step can run thereafter.
- Vite: fast HMR and lean builds.
- React + TS: type-safety and a mature ecosystem.
- Fabric.js: rich HTML5 canvas sketching (brush, layers, undo).
- Tailwind + shadcn/ui + lucide-react: rapid, modern UI.
Hybrid similarity using image/edge/text with per-request weights.
Serves the image resolved from payload path, with safe path checks.
Uses the provided references to generate images.
Last updated: 2025-09-16




