diff --git a/README.md b/README.md index 820cb00..cc2f0cf 100644 --- a/README.md +++ b/README.md @@ -2,11 +2,11 @@ DataHelm is a data engineering framework focused on the following: -- source ingestion and orchestration +- Source ingestion and orchestration - dbt transformation workflows -- notebook-based dashboard execution -- reusable provider connectors (SharePoint, GCS, S3, and BigQuery) -- optional local LLM analytics query scaffolding +- Notebook-based dashboard execution +- Reusable provider connectors (SharePoint, GCS, S3, and BigQuery) +- Optional local LLM analytics query scaffolding ![DataHelm Architecture](https://github.com/DevStrikerTech/datahelm/blob/master/docs/architecture.png?raw=true) @@ -53,18 +53,20 @@ ingestion/ tests/ scripts/ docs/ -``` +```` ## Local Setup ### Prerequisites -- Python 3.12+ -- PostgreSQL (accessible from the local environment) -- Optional: Docker, local Ollama, dbt CLI +Python 3.12+ +PostgreSQL (accessible from the local environment) +Optional: Docker, local Ollama, dbt CLI ### Installation +Run the following commands to set up the local environment: + ```bash python3 -m venv .venv source .venv/bin/activate @@ -74,9 +76,9 @@ pip install -e . ### Environment Variables -Create a `.env` file in the repository root with the required values, for example: +Create a file named `.env` in the root of the repository with the required values, for example: -```env +```text DB_HOST=${DB_HOST} DB_PORT=${DB_PORT} DB_USER=${DB_USER} @@ -87,11 +89,13 @@ CLASHOFCLANS_API_TOKEN=${CLASHOFCLANS_API_TOKEN} ### Run Dagster Locally +To start Dagster locally, run: + ```bash python scripts/run_dagster_dev.py ``` -Useful option for quick verification: +For a quick verification without executing jobs, run: ```bash python scripts/run_dagster_dev.py --print-only @@ -99,51 +103,44 @@ python scripts/run_dagster_dev.py --print-only ## Configuration Model -### Ingestion Config (`config/api/*.yaml`) +### Ingestion Config (config/api/*.yaml) Defines source-level extraction, publish targets, schedules, and column mapping. +Example included: CLASHOFCLANS_PLAYER_STATS -Example currently included: - -- `CLASHOFCLANS_PLAYER_STATS` - -### dbt Config (`config/dbt/projects.yaml`) +### dbt Config (config/dbt/projects.yaml) Defines dbt units, selection/exclusion rules, vars, and schedules. -### Dashboard Config (`config/dashboard/projects.yaml`) +### Dashboard Config (config/dashboard/projects.yaml) Defines notebook path, source table mapping, chart columns, and cadence. -### Analytics Semantic Config (`config/analytics/semantic_catalog.yaml`) +### Analytics Semantic Config (config/analytics/semantic_catalog.yaml) Defines dataset metadata for the isolated NL-to-SQL module. ## Reusable Connectors -The repository includes reusable connector classes under `handlers/`: +The repository includes reusable connector classes under handlers/: -- `handlers/sharepoint/sharepoint.py` - - Microsoft Graph auth + site/file access helpers -- `handlers/gcs/gcs.py` - - upload/download/list/delete/signed URL helpers -- `handlers/s3/s3.py` - - upload/download/list/delete/presigned URL helpers -- `handlers/bigquery/bigquery.py` - - query, row fetch, dataframe load, schema helpers +handlers/sharepoint/sharepoint.py – Microsoft Graph auth + site/file access helpers +handlers/gcs/gcs.py – Upload/download/list/delete/signed URL helpers +handlers/s3/s3.py – Upload/download/list/delete/presigned URL helpers +handlers/bigquery/bigquery.py – Query, row fetch, dataframe load, schema helpers ## Local LLM Analytics Module -`analytics/nl_query/` is an isolated module for natural-language-to-SQL generation using local Ollama: +analytics/nl_query/ is an isolated module for natural-language-to-SQL generation using local Ollama: -- semantic catalog loader -- SQL read-only safety guard -- Ollama client wrapper -- orchestration service +* Semantic catalog loader +* SQL read-only safety guard +* Ollama client wrapper +* Orchestration service ## Testing -Run all tests: +Run all tests with the following command: ```bash .venv/bin/python -m pytest -q @@ -151,26 +148,26 @@ Run all tests: The current test suite includes coverage for: -- ingestion and handler behavior -- analytics factory and runner logic -- connector modules (SharePoint, GCS, S3, BigQuery) -- script behavior -- NL-query safety and service paths +* Ingestion and handler behavior +* Analytics factory and runner logic +* Connector modules (SharePoint, GCS, S3, BigQuery) +* Script behavior +* NL-query safety and service paths ## CI/CD and Branching -- `dev`: integration branch -- `master`: release/production branch +* dev: integration branch +* master: release/production branch Workflows: -- **CI**: tests on development and PR flows -- **Docker Release**: image build/publish on `master` -- **Deploy Release**: workflow_run/manual deployment orchestration +* CI: tests on development and PR flows +* Docker Release: image build/publish on master +* Deploy Release: workflow_run/manual deployment orchestration ## Containerization -Container image is defined via `Dockerfile`. +Container image is defined via Dockerfile. Default runtime command starts the Dagster gRPC server: @@ -182,17 +179,11 @@ python -m dagster api grpc -m dagster_op.repository Deployment flow is workflow-based: -- production auto-path after successful Docker release -- manual staging/production dispatch path - -## Contributing and Governance - -- Contribution guide: `CONTRIBUTING.md` -- Code of conduct: `CODE_OF_CONDUCT.md` -- Security reporting: `SECURITY.md` +* Production auto-path after successful Docker release +* Manual staging/production dispatch path ## Detailed Technical Documentation For complete, long-form project documentation (operations, architecture, and runbook-style details), see: -- `docs/document.md` +docs/document.md diff --git a/scripts/lint_configs.py b/scripts/lint_configs.py new file mode 100644 index 0000000..991e200 --- /dev/null +++ b/scripts/lint_configs.py @@ -0,0 +1,48 @@ +import os +import argparse +import yaml + +def lint_directory(config_dir): + # --- FIX 1: Path Validation --- + if not os.path.isdir(config_dir): + print(f"🚨 Error: The path '{config_dir}' does not exist or is not a directory.") + exit(1) + + print(f"🔍 Linting YAML files in '{config_dir}/'...\n") + + error_count = 0 + file_count = 0 + + for root, _, files in os.walk(config_dir): + for file in files: + if file.endswith((".yaml", ".yml")): + file_count += 1 + filepath = os.path.join(root, file) + + # --- FIX 2: File Read Robustness --- + try: + with open(filepath, 'r', encoding='utf-8') as f: + yaml.safe_load(f) + except OSError as e: + error_count += 1 + print(f"❌ IO Error in: {filepath}\n Details: {e}\n") + except yaml.YAMLError as exc: + error_count += 1 + print(f"❌ Syntax Error in: {filepath}") + if hasattr(exc, 'problem_mark'): + mark = exc.problem_mark + print(f" Hint: Check line {mark.line + 1}, column {mark.column + 1}.\n") + else: + print(f" Details: {exc}\n") + + if error_count == 0: + print(f"✅ Success! Checked {file_count} files and found no errors.") + else: + print(f"🚨 Failed: Found {error_count} error(s).") + exit(1) + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Lint YAML configuration files.") + parser.add_argument("--path", type=str, default="config", help="Path to config directory") + args = parser.parse_args() + lint_directory(args.path) \ No newline at end of file diff --git a/tests/test_lint_configs.py b/tests/test_lint_configs.py new file mode 100644 index 0000000..a2f2ab8 --- /dev/null +++ b/tests/test_lint_configs.py @@ -0,0 +1,17 @@ +import pytest +import subprocess +import os + +def test_lint_success(): + # Tests a valid directory (the default 'config' folder) + result = subprocess.run(["python", "scripts/lint_configs.py", "--path", "config"], capture_output=True, text=True) + assert result.returncode == 0 + assert "Success" in result.stdout + +def test_invalid_path(): + # Tests a non-existent directory + result = subprocess.run(["python", "scripts/lint_configs.py", "--path", "does-not-exist"], capture_output=True, text=True) + assert result.returncode == 1 + assert "Error: The path" in result.stdout + +# You can add more complex tests here later, but this covers the 'fail fast' requirement! \ No newline at end of file