This document is the single reference for local development, architecture, configuration, CI/CD, and operations for this repository.
DataHelm provides a configurable data platform skeleton for:
- source ingestion orchestration
- dbt-based transformations
- notebook-driven dashboard jobs
- reusable source connector handlers
- optional local-LLM analytics query capabilities
The design goal is rapid onboarding of new data sources while reusing shared orchestration and connector patterns.
The project has four primary layers:
ingestion/- source extraction and publish logicanalytics/- dbt orchestration, dashboard helpers, and optional NL-to-SQL moduledagster_op/- jobs, schedules, sensors, and repository registrationconfig/- YAML-driven source/dbt/dashboard/runtime metadata
Execution is orchestrated by Dagster. Ingestion writes raw data to Postgres, dbt builds transformed models, and dashboard jobs generate notebook outputs.
config/api/- ingestion source configsconfig/dbt/- dbt project and unit configconfig/dashboard/- dashboard unit configconfig/analytics/- semantic catalog for optional NL query workflowshandlers/- provider/source handlers (api,sharepoint,gcs,s3,bigquery)ingestion/native_ingestions/- ingestion implementationsanalytics/dbt_projects/- dbt project definitionsanalytics/notebooks/- Dagstermill notebooksscripts/- local utility scriptstests/- unit test suitedocs/- documentation (this file)
- Python 3.12+
- PostgreSQL instance reachable from your machine
- Optional: dbt CLI, Docker, and local Ollama runtime
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e .Create a .env file in repo root with at least:
DB_HOSTDB_PORTDB_USERDB_PASSWORDDB_NAMECLASHOFCLANS_API_TOKEN(for current API example)
Optional examples:
DAGSTER_HOMEDAGSTER_HOME_DIRDBT_TARGETDBT_SCHEMA
python scripts/run_dagster_dev.pyUse --print-only to inspect resolved paths/command without launching:
python scripts/run_dagster_dev.py --print-onlyYAML configs are designed for shared defaults and per-unit overrides using OmegaConf interpolation.
Defines:
ingest_type- extraction init parameters
- extraction runtime params
- publish target and table mapping
- schedules
Defines:
- source/project directory mappings
- profile/target settings
- unit-level select/exclude/vars
- schedules
Defines:
- notebook path
- source table metadata
- chart fields and row limits
- schedules
Defines metadata for optional NL-to-SQL generation:
- dataset registry
- table names
- dimensions/metrics
- business synonyms
- global query rules
The project includes reusable connectors so new sources avoid repeated auth and IO boilerplate.
handlers/sharepoint/sharepoint.py- Graph auth, site resolution, file download, folder listing
handlers/gcs/gcs.py- upload/download/list/delete/signed-url helpers
handlers/s3/s3.py- upload/download/list/delete/presigned-url helpers
handlers/bigquery/bigquery.py- query execution, table reads, dataframe load, schema helper
These connectors are intentionally generic so ingestion implementations can focus on parsing and data contracts.
analytics/nl_query/ provides an isolated local-Ollama NL-to-SQL scaffold.
Includes:
- semantic catalog loader
- SQL safety guardrails (read-only and bounded queries)
- minimal Ollama client
- NL query orchestration service
Important:
- this module is optional and not wired into existing ingestion flow by default
- no production behavior changes unless explicitly integrated
Run full test suite:
.venv/bin/python -m pytest -qCoverage currently includes:
- handler logic and edge cases
- ingestion factory and native run paths
- base ingestion helper branches
- analytics dbt runner and factory behavior
- script/bootstrap behavior
- connector modules (SharePoint, GCS, S3, BigQuery)
- NL query scaffold modules
Branching:
dev= integration branchmaster= release/prod branch
Workflows:
CIvalidates tests for development flowDocker Releasebuilds/pushes image onmasterDeploy Releasesupports auto/manual deployment paths
Deployment uses SSH + GHCR pull model and gracefully skips remote deployment when secrets are not yet configured.
Shared:
GHCR_USERNAMEGHCR_READ_TOKEN
Staging:
STAGING_SSH_HOSTSTAGING_SSH_USERSTAGING_SSH_KEYSTAGING_APP_ENV_FILE
Production:
PROD_SSH_HOSTPROD_SSH_USERPROD_SSH_KEYPROD_APP_ENV_FILE
.venv/bin/python -m pytest -qpython scripts/run_dagster_dev.pydocker build -t datahelm:local .