Skip to content
99 changes: 45 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

DataHelm is a data engineering framework focused on the following:

- source ingestion and orchestration
- Source ingestion and orchestration
- dbt transformation workflows
- notebook-based dashboard execution
- reusable provider connectors (SharePoint, GCS, S3, and BigQuery)
- optional local LLM analytics query scaffolding
- Notebook-based dashboard execution
- Reusable provider connectors (SharePoint, GCS, S3, and BigQuery)
- Optional local LLM analytics query scaffolding

![DataHelm Architecture](https://github.com/DevStrikerTech/datahelm/blob/master/docs/architecture.png?raw=true)

Expand Down Expand Up @@ -53,18 +53,20 @@ ingestion/
tests/
scripts/
docs/
```
````

## Local Setup

### Prerequisites

- Python 3.12+
- PostgreSQL (accessible from the local environment)
- Optional: Docker, local Ollama, dbt CLI
Python 3.12+
PostgreSQL (accessible from the local environment)
Optional: Docker, local Ollama, dbt CLI

### Installation

Run the following commands to set up the local environment:

```bash
python3 -m venv .venv
source .venv/bin/activate
Expand All @@ -74,9 +76,9 @@ pip install -e .

### Environment Variables

Create a `.env` file in the repository root with the required values, for example:
Create a file named `.env` in the root of the repository with the required values, for example:

```env
```text
DB_HOST=${DB_HOST}
DB_PORT=${DB_PORT}
DB_USER=${DB_USER}
Expand All @@ -87,90 +89,85 @@ CLASHOFCLANS_API_TOKEN=${CLASHOFCLANS_API_TOKEN}

### Run Dagster Locally

To start Dagster locally, run:

```bash
python scripts/run_dagster_dev.py
```

Useful option for quick verification:
For a quick verification without executing jobs, run:

```bash
python scripts/run_dagster_dev.py --print-only
```

## Configuration Model

### Ingestion Config (`config/api/*.yaml`)
### Ingestion Config (config/api/*.yaml)

Defines source-level extraction, publish targets, schedules, and column mapping.
Example included: CLASHOFCLANS_PLAYER_STATS

Example currently included:

- `CLASHOFCLANS_PLAYER_STATS`

### dbt Config (`config/dbt/projects.yaml`)
### dbt Config (config/dbt/projects.yaml)

Defines dbt units, selection/exclusion rules, vars, and schedules.

### Dashboard Config (`config/dashboard/projects.yaml`)
### Dashboard Config (config/dashboard/projects.yaml)

Defines notebook path, source table mapping, chart columns, and cadence.

### Analytics Semantic Config (`config/analytics/semantic_catalog.yaml`)
### Analytics Semantic Config (config/analytics/semantic_catalog.yaml)

Defines dataset metadata for the isolated NL-to-SQL module.

## Reusable Connectors

The repository includes reusable connector classes under `handlers/`:
The repository includes reusable connector classes under handlers/:

- `handlers/sharepoint/sharepoint.py`
- Microsoft Graph auth + site/file access helpers
- `handlers/gcs/gcs.py`
- upload/download/list/delete/signed URL helpers
- `handlers/s3/s3.py`
- upload/download/list/delete/presigned URL helpers
- `handlers/bigquery/bigquery.py`
- query, row fetch, dataframe load, schema helpers
handlers/sharepoint/sharepoint.py – Microsoft Graph auth + site/file access helpers
handlers/gcs/gcs.py – Upload/download/list/delete/signed URL helpers
handlers/s3/s3.py – Upload/download/list/delete/presigned URL helpers
handlers/bigquery/bigquery.py – Query, row fetch, dataframe load, schema helpers

## Local LLM Analytics Module

`analytics/nl_query/` is an isolated module for natural-language-to-SQL generation using local Ollama:
analytics/nl_query/ is an isolated module for natural-language-to-SQL generation using local Ollama:

- semantic catalog loader
- SQL read-only safety guard
- Ollama client wrapper
- orchestration service
* Semantic catalog loader
* SQL read-only safety guard
* Ollama client wrapper
* Orchestration service

## Testing

Run all tests:
Run all tests with the following command:

```bash
.venv/bin/python -m pytest -q
```

The current test suite includes coverage for:

- ingestion and handler behavior
- analytics factory and runner logic
- connector modules (SharePoint, GCS, S3, BigQuery)
- script behavior
- NL-query safety and service paths
* Ingestion and handler behavior
* Analytics factory and runner logic
* Connector modules (SharePoint, GCS, S3, BigQuery)
* Script behavior
* NL-query safety and service paths

## CI/CD and Branching

- `dev`: integration branch
- `master`: release/production branch
* dev: integration branch
* master: release/production branch

Workflows:

- **CI**: tests on development and PR flows
- **Docker Release**: image build/publish on `master`
- **Deploy Release**: workflow_run/manual deployment orchestration
* CI: tests on development and PR flows
* Docker Release: image build/publish on master
* Deploy Release: workflow_run/manual deployment orchestration

## Containerization

Container image is defined via `Dockerfile`.
Container image is defined via Dockerfile.

Default runtime command starts the Dagster gRPC server:

Expand All @@ -182,17 +179,11 @@ python -m dagster api grpc -m dagster_op.repository

Deployment flow is workflow-based:

- production auto-path after successful Docker release
- manual staging/production dispatch path

## Contributing and Governance

- Contribution guide: `CONTRIBUTING.md`
- Code of conduct: `CODE_OF_CONDUCT.md`
- Security reporting: `SECURITY.md`
* Production auto-path after successful Docker release
* Manual staging/production dispatch path

## Detailed Technical Documentation

For complete, long-form project documentation (operations, architecture, and runbook-style details), see:

- `docs/document.md`
docs/document.md
48 changes: 48 additions & 0 deletions scripts/lint_configs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
import os
import argparse
import yaml

def lint_directory(config_dir):
# --- FIX 1: Path Validation ---
if not os.path.isdir(config_dir):
print(f"🚨 Error: The path '{config_dir}' does not exist or is not a directory.")
exit(1)

print(f"🔍 Linting YAML files in '{config_dir}/'...\n")

error_count = 0
file_count = 0

for root, _, files in os.walk(config_dir):
for file in files:
if file.endswith((".yaml", ".yml")):
file_count += 1
filepath = os.path.join(root, file)

# --- FIX 2: File Read Robustness ---
try:
with open(filepath, 'r', encoding='utf-8') as f:
yaml.safe_load(f)
except OSError as e:
error_count += 1
print(f"❌ IO Error in: {filepath}\n Details: {e}\n")
except yaml.YAMLError as exc:
error_count += 1
print(f"❌ Syntax Error in: {filepath}")
if hasattr(exc, 'problem_mark'):
mark = exc.problem_mark
print(f" Hint: Check line {mark.line + 1}, column {mark.column + 1}.\n")
else:
print(f" Details: {exc}\n")

if error_count == 0:
print(f"✅ Success! Checked {file_count} files and found no errors.")
else:
print(f"🚨 Failed: Found {error_count} error(s).")
exit(1)

if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Lint YAML configuration files.")
parser.add_argument("--path", type=str, default="config", help="Path to config directory")
args = parser.parse_args()
lint_directory(args.path)
17 changes: 17 additions & 0 deletions tests/test_lint_configs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import pytest
import subprocess
import os

def test_lint_success():
# Tests a valid directory (the default 'config' folder)
result = subprocess.run(["python", "scripts/lint_configs.py", "--path", "config"], capture_output=True, text=True)
assert result.returncode == 0
assert "Success" in result.stdout

def test_invalid_path():
# Tests a non-existent directory
result = subprocess.run(["python", "scripts/lint_configs.py", "--path", "does-not-exist"], capture_output=True, text=True)
assert result.returncode == 1
assert "Error: The path" in result.stdout

# You can add more complex tests here later, but this covers the 'fail fast' requirement!
Loading