diff --git a/README.md b/README.md index 820cb00..cc2f0cf 100644 --- a/README.md +++ b/README.md @@ -2,11 +2,11 @@ DataHelm is a data engineering framework focused on the following: -- source ingestion and orchestration +- Source ingestion and orchestration - dbt transformation workflows -- notebook-based dashboard execution -- reusable provider connectors (SharePoint, GCS, S3, and BigQuery) -- optional local LLM analytics query scaffolding +- Notebook-based dashboard execution +- Reusable provider connectors (SharePoint, GCS, S3, and BigQuery) +- Optional local LLM analytics query scaffolding ![DataHelm Architecture](https://github.com/DevStrikerTech/datahelm/blob/master/docs/architecture.png?raw=true) @@ -53,18 +53,20 @@ ingestion/ tests/ scripts/ docs/ -``` +```` ## Local Setup ### Prerequisites -- Python 3.12+ -- PostgreSQL (accessible from the local environment) -- Optional: Docker, local Ollama, dbt CLI +Python 3.12+ +PostgreSQL (accessible from the local environment) +Optional: Docker, local Ollama, dbt CLI ### Installation +Run the following commands to set up the local environment: + ```bash python3 -m venv .venv source .venv/bin/activate @@ -74,9 +76,9 @@ pip install -e . ### Environment Variables -Create a `.env` file in the repository root with the required values, for example: +Create a file named `.env` in the root of the repository with the required values, for example: -```env +```text DB_HOST=${DB_HOST} DB_PORT=${DB_PORT} DB_USER=${DB_USER} @@ -87,11 +89,13 @@ CLASHOFCLANS_API_TOKEN=${CLASHOFCLANS_API_TOKEN} ### Run Dagster Locally +To start Dagster locally, run: + ```bash python scripts/run_dagster_dev.py ``` -Useful option for quick verification: +For a quick verification without executing jobs, run: ```bash python scripts/run_dagster_dev.py --print-only @@ -99,51 +103,44 @@ python scripts/run_dagster_dev.py --print-only ## Configuration Model -### Ingestion Config (`config/api/*.yaml`) +### Ingestion Config (config/api/*.yaml) Defines source-level extraction, publish targets, schedules, and column mapping. +Example included: CLASHOFCLANS_PLAYER_STATS -Example currently included: - -- `CLASHOFCLANS_PLAYER_STATS` - -### dbt Config (`config/dbt/projects.yaml`) +### dbt Config (config/dbt/projects.yaml) Defines dbt units, selection/exclusion rules, vars, and schedules. -### Dashboard Config (`config/dashboard/projects.yaml`) +### Dashboard Config (config/dashboard/projects.yaml) Defines notebook path, source table mapping, chart columns, and cadence. -### Analytics Semantic Config (`config/analytics/semantic_catalog.yaml`) +### Analytics Semantic Config (config/analytics/semantic_catalog.yaml) Defines dataset metadata for the isolated NL-to-SQL module. ## Reusable Connectors -The repository includes reusable connector classes under `handlers/`: +The repository includes reusable connector classes under handlers/: -- `handlers/sharepoint/sharepoint.py` - - Microsoft Graph auth + site/file access helpers -- `handlers/gcs/gcs.py` - - upload/download/list/delete/signed URL helpers -- `handlers/s3/s3.py` - - upload/download/list/delete/presigned URL helpers -- `handlers/bigquery/bigquery.py` - - query, row fetch, dataframe load, schema helpers +handlers/sharepoint/sharepoint.py – Microsoft Graph auth + site/file access helpers +handlers/gcs/gcs.py – Upload/download/list/delete/signed URL helpers +handlers/s3/s3.py – Upload/download/list/delete/presigned URL helpers +handlers/bigquery/bigquery.py – Query, row fetch, dataframe load, schema helpers ## Local LLM Analytics Module -`analytics/nl_query/` is an isolated module for natural-language-to-SQL generation using local Ollama: +analytics/nl_query/ is an isolated module for natural-language-to-SQL generation using local Ollama: -- semantic catalog loader -- SQL read-only safety guard -- Ollama client wrapper -- orchestration service +* Semantic catalog loader +* SQL read-only safety guard +* Ollama client wrapper +* Orchestration service ## Testing -Run all tests: +Run all tests with the following command: ```bash .venv/bin/python -m pytest -q @@ -151,26 +148,26 @@ Run all tests: The current test suite includes coverage for: -- ingestion and handler behavior -- analytics factory and runner logic -- connector modules (SharePoint, GCS, S3, BigQuery) -- script behavior -- NL-query safety and service paths +* Ingestion and handler behavior +* Analytics factory and runner logic +* Connector modules (SharePoint, GCS, S3, BigQuery) +* Script behavior +* NL-query safety and service paths ## CI/CD and Branching -- `dev`: integration branch -- `master`: release/production branch +* dev: integration branch +* master: release/production branch Workflows: -- **CI**: tests on development and PR flows -- **Docker Release**: image build/publish on `master` -- **Deploy Release**: workflow_run/manual deployment orchestration +* CI: tests on development and PR flows +* Docker Release: image build/publish on master +* Deploy Release: workflow_run/manual deployment orchestration ## Containerization -Container image is defined via `Dockerfile`. +Container image is defined via Dockerfile. Default runtime command starts the Dagster gRPC server: @@ -182,17 +179,11 @@ python -m dagster api grpc -m dagster_op.repository Deployment flow is workflow-based: -- production auto-path after successful Docker release -- manual staging/production dispatch path - -## Contributing and Governance - -- Contribution guide: `CONTRIBUTING.md` -- Code of conduct: `CODE_OF_CONDUCT.md` -- Security reporting: `SECURITY.md` +* Production auto-path after successful Docker release +* Manual staging/production dispatch path ## Detailed Technical Documentation For complete, long-form project documentation (operations, architecture, and runbook-style details), see: -- `docs/document.md` +docs/document.md