Skip to content

ricjhill/riskflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

455 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RiskFlow

Automates the mapping of messy reinsurance spreadsheets (Bordereaux) to a standardized schema using Small Language Models (Groq/Llama 3.3).

Security prerequisite for any deployment. RiskFlow's authentication assumes the operator has configured Entra Conditional Access to require phishing-resistant authentication (FIDO2 / passkey / certificate-based / Windows Hello for Business) and compliant-device-only access. Without these policies, the deployment is vulnerable to adversary-in-the-middle phishing kits (evilginx, EvilProxy, Modlishka) which can defeat PKCE-only MFA. The operator-side runbook for these policies is the Entra ID auth operator runbook; see docs/azure-auth-implementation-plan.md §"Threat model" for the underlying analysis. This is not optional — it is the deployment's load-bearing security assumption.

GUI auth status (local dev). The Streamlit GUI's sign-in flow is currently broken end-to-end (#331). The replacement is a server-side Backend-For-Frontend (#332, accepted in ADR-0001). For local development today, leave ENTRA_TENANT_ID / ENTRA_AUDIENCE unset (the API's NullIdentityProvider engages) or mint a bearer token via az account get-access-token for direct API calls.

Documentation

Prerequisites

  • Python 3.12+
  • uv
  • Docker & Docker Compose (for Redis)
  • A Groq account and API key (free tier available)

Getting Started

One command (Docker — recommended)

# Copy environment template and add your Groq API key
cp .env.example .env
# Edit .env and set GROQ_API_KEY=gsk_your_key_here

# Start everything: API + Redis + GUI
docker compose up -d

Open http://localhost:8501 for the GUI, or http://localhost:8000/docs for the API.

Local development (without Docker)

# Install dependencies
uv sync

# Copy environment template and add your Groq API key
cp .env.example .env

# Start Redis (still needs Docker)
docker compose up -d redis

# Run the API
uv run uvicorn src.entrypoint.main:app --reload --port 8000

# Run the GUI (in a separate terminal)
uv run streamlit run gui/app.py

Running with auth disabled (local testing only)

Protected routes (e.g. /upload) require an Entra bearer token. To exercise them locally without minting tokens — e.g. to drive the Groq mapping pipeline or the #197 rate-limit experiment — run the dev-only auth-disabled overlay:

docker compose -f docker-compose.yml -f docker-compose.dev-auth-disabled.yml up -d

⚠️ DEV ONLY. This overlay runs src.entrypoint.dev_main:app, which opens every protected route (no token required). Never use it in production or CI — those run src.entrypoint.main:app, which is fully auth-gated. The overlay is opt-in (explicit -f) and never auto-loads. See ADR-0003.

Running in GitHub Codespaces

The committed docker-compose.yml is tuned for a 4-CPU host (cpus: "4", Dockerfile CMD sets --workers 4). The default Codespaces machine has only 2 CPUs, so the unmodified compose file fails to start with range of CPUs is from 0.01 to 2.00, as there are only 2 CPUs available.

docker-compose.codespaces.yml is a small overlay that caps the api service to 2 CPUs and overrides the command to --workers 2. Pass it alongside the base compose file:

docker compose -f docker-compose.yml -f docker-compose.codespaces.yml up -d

No other changes are required. The overlay only touches the api service; Redis and the GUI use the same configuration as the regular Docker quickstart.

Development

# Run tests
uv run pytest -x -v tests/unit/

# Type checking
uv run mypy src/

# Lint and format
uv run ruff check src/
uv run ruff format src/

TDD Cycle

  1. Red — Write a failing test in tests/unit/
  2. Green — Implement the minimum code in src/domain/ or src/adapters/ to make it pass
  3. Check — Run uv run mypy src/ and uv run ruff check src/
  4. Commit — If all pass, commit with a descriptive message

Claude Code hooks enforce this — they block any commit where mypy, pytest, ruff check, or ruff format fail. GitHub Actions CI provides the same checks on PRs and pushes to main.

Architecture

Hexagonal (Ports & Adapters). Dependencies only point inward.

graph LR
    subgraph External
        Client([Client])
        GUI([Streamlit GUI<br>port 8501])
        Excel[(Excel/CSV)]
        Groq([Groq API])
        Redis[(Redis)]
    end

    subgraph Adapters
        HTTP[HTTP Adapter<br>FastAPI Routes]
        Parser[Parser Adapter<br>Polars Ingestor]
        SchemaLoader[Schema Loader<br>YAML Parser]
        SLM[SLM Adapter<br>Groq Mapper]
        Cache[Cache Adapter<br>Redis Client]
        CorrCache[Correction Cache<br>Redis Hash]
        SessionStore[Session Store<br>Redis + TTL]
        SchemaStore[Schema Store<br>Redis]
        JobStore[Job Store<br>Redis / In-Memory]
    end

    subgraph Ports
        IngestorPort{{IngestorPort}}
        MapperPort{{MapperPort}}
        CachePort{{CachePort}}
        CorrectionCachePort{{CorrectionCachePort}}
        SessionStorePort{{MappingSessionStorePort}}
        SchemaStorePort{{SchemaStorePort}}
        JobStorePort{{JobStorePort}}
        SchemaLoaderPort{{SchemaLoaderPort}}
    end

    subgraph Domain
        Service[MappingService]
        Session[MappingSession<br>CREATED to FINALISED]
        Models[TargetSchema<br>ColumnMapping<br>MappingResult<br>ConfidenceReport<br>Correction]
        RecordFactory[record_factory<br>Dynamic pydantic models]
        DateFormat[date_format<br>Column-level detection]
        Errors[Domain Errors]
    end

    GUI -->|HTTP| HTTP
    Client -->|upload, corrections, schemas| HTTP
    Client -->|sessions, async jobs| HTTP
    HTTP --> Service
    Service --> IngestorPort
    Service --> MapperPort
    Service --> CachePort
    Service --> CorrectionCachePort
    Service --> RecordFactory
    Service --> Models
    Service --> DateFormat
    IngestorPort -.-> Parser
    MapperPort -.-> SLM
    CachePort -.-> Cache
    CorrectionCachePort -.-> CorrCache
    SessionStorePort -.-> SessionStore
    SchemaStorePort -.-> SchemaStore
    JobStorePort -.-> JobStore
    SchemaLoaderPort -.-> SchemaLoader
    Parser --> Excel
    SLM --> Groq
    Cache --> Redis
    CorrCache --> Redis
    SessionStore --> Redis
    SchemaStore --> Redis
Loading

Data flows:

  • Batch: Upload → Parse headers → Check cache → (miss?) Check corrections → SLM maps uncorrected headers → Merge → Validate rows → Return results with confidence report
  • Interactive: Upload → SLM suggests → User edits mappings → Finalise → Validate rows → Return results

Endpoints:

Every endpoint except /health, /ready, /live, /docs, /redoc, and /openapi.json requires Authorization: Bearer <token>. See docs/reference/api.md for the authentication contract and docs/azure-auth-implementation-plan.md for the Entra ID setup.

Endpoint Method Auth Description
/upload POST required Synchronous upload with optional ?sheet_name, ?cedent_id, ?schema
/upload/async POST required Async upload, returns job ID for polling
/jobs GET required List all async jobs with filename and upload date
/jobs/{id} GET required Poll async job status and result
/sheets POST required List sheet names in an Excel file
/corrections POST required Submit human-verified mapping corrections
/schemas GET required List available target schemas
/schemas/{name} GET required View a schema's full definition
/schemas POST required Create a runtime schema from JSON
/schemas/{name} DELETE required Delete a runtime schema
/sessions POST required Upload file, get SLM suggestion + preview (interactive)
/sessions/{id} GET required Current session state
/sessions/{id}/mappings PUT required Edit mappings before finalising
/sessions/{id}/target-fields PATCH required Add custom target fields to a session
/sessions/{id}/finalise POST required Validate rows with user's mapping
/sessions/{id} DELETE required Cleanup session + temp file
/me GET required Authenticated caller's identity + cedent assignments (Phase 1 returns empty list)
/health GET open Combined health check (includes Redis status)
/ready GET open Kubernetes readiness probe (503 if Redis unreachable)
/live GET open Kubernetes liveness probe (200 if process alive)

Unauthenticated requests to protected endpoints return 401 UNAUTHORIZED with an RFC 6750 WWW-Authenticate header. When the API can't reach Entra's JWKS endpoint, requests fail with 503 AUTH_INFRASTRUCTURE_UNAVAILABLE rather than 401 (the token isn't necessarily invalid — we just can't check it).

src/
  entrypoint/        # FastAPI wiring (composition root) — incl. auth + middleware wiring
  domain/
    model/           # TargetSchema, MappingSession, ColumnMapping, date_format, errors,
                     # User (authenticated caller identity)
    service/         # MappingService (orchestration)
  ports/
    input/           # IngestorPort
    output/          # MapperPort, CachePort, SessionStorePort, SchemaStorePort,
                     # IdentityProviderPort, ...
  adapters/
    http/            # FastAPI routes, RequestIdMiddleware, SecurityHeadersMiddleware,
                     # auth dependency (Depends(require_user))
    auth/            # EntraJwtValidator + JwksCache (Entra ID JWT validation)
    slm/             # Groq API mapper
    storage/         # Redis cache, session store, schema store, job store
    parsers/         # Polars ingestor, YAML schema loader

Target Schema

The default target schema (schemas/standard_reinsurance.yaml) maps Bordereaux data to:

Field Type Constraints
Policy_ID String Not empty
Inception_Date Date Required
Expiry_Date Date Must not precede Inception_Date
Sum_Insured Float Non-negative
Gross_Premium Float Non-negative
Currency Currency USD, GBP, EUR, JPY

The schema is configurable via YAML. Custom schemas can define different fields, types, constraints, cross-field rules, and SLM hints. See:

License

This project is dual-licensed:

  • Open Source: GNU General Public License v3.0 — free for open-source use, modification, and distribution under GPL terms.
  • Commercial: For use in proprietary or closed-source products without GPL obligations, a commercial license is available. Contact ricjhill for details.

© 2025-2026 ricjhill. All rights reserved.

About

Reinsurance data mapper — automates mapping of Bordereaux spreadsheets to a standardized schema using SLMs

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages