Monorepo for PDF ingestion, bibliography extraction, metadata enrichment, and download queueing.
frontend/: Svelte + Vite appbackend/: Node.js API and orchestration routesbackend/scripts/daemon/worker.py: queue consumer daemondl_lit_project/: canonical Python pipeline package (dl_lit)dl_lit/: legacy scripts (not canonical runtime)
The app is DB-first and queue-first.
- Backend writes jobs to
pipeline_jobsindl_lit_project/data/literature.db. rag_feeder_workerpollspipeline_jobsand executes jobs.- Worker writes completion/failure payloads back to
pipeline_jobs.result_json.
Supported daemon job types in current code:
enrichdownloadpipeline_tick(mark -> enrich -> download)
Primary runtime tables:
works: canonical work records, including metadata/download status and file infocorpus_works: corpus membership join table
Status lives directly on works:
metadata_status:pending | in_progress | matched | faileddownload_status:not_requested | queued | in_progress | downloaded | failed
rag_feeder_frontendonhttp://localhost:5175rag_feeder_backendonhttp://localhost:4000rag_feeder_worker(no HTTP port)
- Set
.envvalues (at leastGOOGLE_API_KEY, optionalOPENALEX_API_KEY). - Start stack:
docker compose up -d
- Open:
http://localhost:5175
The production site does not hot reload. Rebuild the frontend when you want changes on the live stack.
Do not patch tracked compose files on the server. Keep the live host and Caddy labels in an untracked docker-compose.override.yml instead so a git pull cannot wipe them.
- Add the live host to
.env:RAG_FEEDER_PUBLIC_HOST=corpus4uol.university-of-labour.deRAG_FEEDER_PROXY_NETWORK=reverse_proxy
- Create a local override once:
cp docker-compose.override.example.yml docker-compose.override.yml
- Deploy normally after pulls:
docker compose up -d
docker compose loads docker-compose.override.yml automatically, so the frontend keeps its caddy labels and host overrides without requiring a special command.
- SQLite DB:
dl_lit_project/data/literature.db - Uploaded PDFs inside container:
/usr/src/app/uploads - Upload volume:
rag_feeder_uploads - Logs volume:
rag_feeder_logs - Pipeline log file:
/usr/src/app/logs/backend-pipeline.log
/api/ingest/process-marked, /api/downloads/worker/start, and /api/downloads/worker/run-once queue real jobs immediately.
/api/pipeline/worker/start and /api/pipeline/worker/pause currently update in-memory API state; continuous interval scheduling is transitional in current implementation.
- Backend details:
backend/README.md - Frontend details:
frontend/README.md - Python pipeline details:
dl_lit_project/README.md