Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
4eaedb7
Add an initial Terraform project to provision cloud infrastructure
martyngigg Jan 23, 2026
3ba932d
WIP: Remove Ansible provisioning code
martyngigg Jan 23, 2026
9dfdae7
Add base configuration to each deploy role
martyngigg Jan 23, 2026
afd6eff
WIP: Move to single site.yml playbook
martyngigg Jan 23, 2026
29630c2
Successful postgres deployment.
martyngigg Jan 24, 2026
58f15eb
Remove traefik deploy playbook
martyngigg Jan 24, 2026
7c95a01
Move keycloak deploy to site.yml
martyngigg Jan 24, 2026
c602e2e
Working Traefik/Keycloak on dev-analytics.isis.cclrc.ac.uk
martyngigg Jan 26, 2026
fc975f0
Working Keycloak deployment in dev
martyngigg Jan 26, 2026
e934c02
Deploy a minio instance for dev to simulate S3 Echo
martyngigg Jan 26, 2026
3b32f15
Begin deploying Lakekeeper
martyngigg Jan 27, 2026
b0d0df2
Don't bootstrap Lakekeeper admin yet. OpenFGA is not available
martyngigg Jan 28, 2026
de20ca2
Always bootstrap Keycloak admin
martyngigg Jan 28, 2026
45c65f7
Use default Lakekeeper project
martyngigg Jan 29, 2026
69da568
Use common image and ssh key in terraform scripts
martyngigg Jan 29, 2026
e658ba2
Working Trino dev deployment
martyngigg Jan 29, 2026
0cf1573
Remove old sample inventory
martyngigg Jan 29, 2026
217f7c2
WIP: Reorg of Superset role to simplify variable mappings
martyngigg Jan 29, 2026
88055a2
Allow alternate redis database to be used for Superset
martyngigg Feb 3, 2026
f62d546
Working dev version of superset_farm node
martyngigg Feb 3, 2026
0b2cee1
Update secrets.
martyngigg Feb 4, 2026
ff33909
Rename ansible-docker -> ansible and move terraform inside.
martyngigg Feb 4, 2026
548fe73
Add playbook for updating running packages.
martyngigg Feb 4, 2026
52cbe27
Update README and documentation
martyngigg Feb 4, 2026
bdf149f
Add deployment of ELT node
martyngigg Feb 4, 2026
c5ad7c1
Use interactive container for superset-cli
martyngigg Feb 4, 2026
9b96f9f
Disable Superset alerting
martyngigg Feb 4, 2026
46513d1
Update Superset image to v6-based image.
martyngigg Feb 4, 2026
96a449b
Default to verify=True for Trino maintenance jobs
martyngigg Feb 4, 2026
17cf1a8
Fix type conversion
martyngigg Feb 4, 2026
28b3552
Fixup access controls for Trino and client secrets
martyngigg Feb 4, 2026
a75dbaf
Produce more output with GitHub compose action
martyngigg Feb 4, 2026
653a54e
Fix paths after dir renaming
martyngigg Feb 4, 2026
e99a653
Fix gitattributes path
martyngigg Feb 4, 2026
6d10bcc
Use pydantic_settings to configure cli arguments
martyngigg Feb 4, 2026
27c54d6
Fix variable name
martyngigg Feb 4, 2026
8338161
Fix file permissions and ownership
martyngigg Feb 4, 2026
f4ab0e1
Fix typo
martyngigg Feb 4, 2026
1d03fb2
Prevent clash of Iceberg maintenance jobs
martyngigg Feb 4, 2026
d40bb09
Upgrade Keycloak to address CVEs
martyngigg Feb 4, 2026
5d58e30
Remove duplicated key
martyngigg Feb 4, 2026
828d01f
Avoid newlines passed to docker command
martyngigg Feb 4, 2026
2f42cf4
Fix permissions on secrets files
martyngigg Feb 4, 2026
8c40b9e
Fix env mapping syntax
martyngigg Feb 4, 2026
7b9761d
Add redis healthcheck
martyngigg Feb 4, 2026
9e18388
Fix directory modes
martyngigg Feb 4, 2026
4894f74
Fix trino.list_tables test
martyngigg Feb 4, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitattributes
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
infra/ansible-docker/group_vars/all/vault.yml diff=ansible-vault
infra/ansible/group_vars/all/vault.yml diff=ansible-vault
infra/ansible/inventories/dev/group_vars/all/vault.yml diff=ansible-vault
infra/ansible/inventories/qa/group_vars/all/vault.yml diff=ansible-vault
4 changes: 2 additions & 2 deletions .github/workflows/warehouses_e2e_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,10 +32,10 @@ jobs:
uses: actions/checkout@v6

- name: Bring up Docker Compose services
uses: hoverkraft-tech/compose-action@v2.3.0
uses: hoverkraft-tech/compose-action@v2.5.0
with:
compose-file: infra/local/docker-compose.yml
up-flags: --quiet-pull --wait --wait-timeout 300
up-flags: --wait --wait-timeout 300
down-flags: --volumes --remove-orphans

- name: Add adp-router to /etc/hosts
Expand Down
6 changes: 3 additions & 3 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ common developer tasks.

## Cloud deployment

- Use the Ansible playbooks in `infra/ansible-docker/` together with the
`inventory*.yml` files. See the `infra/ansible-docker/readme.md` for role and
- Use the Ansible playbooks in `infra/ansible/` together with the
`inventories/**` files. See the `infra/ansible/readme.md` for role and
variable guidance.

## Pull Request Guidelines
Expand All @@ -55,7 +55,7 @@ common developer tasks.
- Docker resource issues: the local compose stack can be resource heavy. Ensure
Docker Desktop has enough CPU/memory.
- Ansible role errors: ensure you have required galaxy roles (see
`infra/ansible-docker/ansible-galaxy-requirements.yaml`) and the correct Python
`infra/ansible/ansible-galaxy-requirements.yaml`) and the correct Python
and Ansible versions installed.

## Where to go next
Expand Down
83 changes: 18 additions & 65 deletions docs-devel/deployment/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,78 +9,31 @@ _It is not yet a production-grade, HA system._
There are several prerequisites steps required before deployment can begin. Please
read [them here](./prerequisites.md).

## Ansible
## Provision VMs

Ansible playbooks in `<repo_root>/ansible-docker/playbooks` control the deployment
of the system. All commands in this section assume that the current working directory
is `infra/ansible-docker`.
Choose the environment you are configuring for, `dev` or `qa`, and provision the cloud resources:

### Networking

Create the private VM network:

```sh
ansible-playbook playbooks/cloud/private_network_create.yml
```

Create a node for Traefik (also acts as an SSH jump node):

```sh
ansible-playbook \
-e openstack_cloud_name=<cloud_name> \
-e openstack_key_name=<ssh_key> \
-e vm_config_file=$PWD/playbooks/traefik/group_vars/traefik.yml
playbooks/cloud/vm_create.yml
```bash
> cd infra/ansible/terraform
> tofu init
> tofu plan -var-file <tfvars-file> -var cloud_name=<name-in-clouds-yaml>
> tofu apply -var-file <tfvars-file> -var cloud_name=<name-in-clouds-yaml>
```

where _cloud\_name_ and _ssh\_key_ are described in the [prerequisites section](./prerequisites.md#openstack-api--vm-credentials).
Take a note of the new node ip address and create a new inventory file:
Move the newly generated inventory `.ini` file to `infra/ansible/inventories/<dev|qa>`.

```sh
cp inventory-sample.yml inventory.yml
```
## Services

Fill in the new Traefik ip address and deploy Traefik:
Deploy the services using Ansible:

```sh
ansible-playbook -i inventory.yml playbooks/traefik/deploy.yml
```bash
> cd infra/ansible
> ansible-playbook -i inventories/<dev|qa>/inventory.ini site.yml
```

Once deployed check the Traefik dashboard is available at `https://<domain>/traefik/dashboard/.`
The passwords are in Keeper.

### Services

Now we deploy the remaining services. The deployment order is important as some
services depend on others being available. Each service has a single VM with the
exception of Superset that hosts multiple instances on a single node.

First create the VMs:

```sh
for svc in keycloak lakekeeper trino elt; do
ansible-playbook \
-e openstack_cloud_name=<cloud_name> \
-e openstack_key_name=<ssh_key> \
-e vm_config_file=$PWD/playbooks/$svc/group_vars/$svc.yml
playbooks/cloud/vm_create.yml
done
# Now Superset
ansible-playbook \
-e openstack_cloud_name=<cloud_name> \
-e openstack_key_name=<ssh_key> \
-e vm_config_file=$PWD/playbooks/superset/vm_vars.yml
playbooks/cloud/vm_create.yml
```

Gather the new ip addresses of each VM and fill in the appropriate section of the new `inventory.yml` created above.

Now deploy the services:

```sh
for svc in keycloak lakekeeper trino elt superset; do
ansible-playbook -i inventory.yml playbooks/$svc/deploy.yml
done
```
Once deployed the services are available at:

Superset should be available at `https://<domain>/workspace/accelerator`.
- Keycloak: <https://\domain\>/iceberg>
- Lakekeeper: <https://\<domain\>/authn>
- Superset instances:
- <https://\<domain\>/workspace/accelerator>
43 changes: 28 additions & 15 deletions docs-devel/deployment/prerequisites.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,26 @@

The following resources are required before deployment can proceed:

- [OpenTofu](#terraformopentofu)
- [Python environment configured for running Ansible](#python-environment)
- [Ansible vault password](#ansible-vault)
- [Openstack clouds.yaml](#openstack-api--vm-credentials)
- [A shared filesystem through a Manila share](#manila-share)
- [Object storage](#object-store)
- [Networking](#networking)

## Terraform/OpenTofu

[Terraform](https://developer.hashicorp.com/terraform) is used to provision resources on an Openstack
cloud. See the resources definitions in [terraform](../../infra/ansible/terraform/).

*Note: Terraform no longer has an open-source license. [OpenTofu](https://opentofu.org/) is a*
*drop-in replacement supported by CNCF, the `terraform` command can be replaced by `tofu`*
*wherever it appears in external documentation.*

- Install OpenTofu using their documented method for your platform: <https://opentofu.org/docs/intro/install/>
- Run `tofu init` in the `../../infra/ansible/terraform/` directory.

## Python environment

Ansible requires a Python environment. These instructions assume the use of the
Expand Down Expand Up @@ -46,40 +59,40 @@ stored locally in a `<repo_root>/ansible/.vault_pass` file. **Do not share this

## Manila share

_Used for: Persistent storage for running system services, e.g. database data. Not used for user data._
*Used for: Persistent storage for running system services, e.g. database data. Not used for user data.*

A Manila/CephFS share of at least 5TB is required. Once a quota has been assigned to the project:

- Create a new share, under _Project->Share->Shares_, and mark it private.
- Click on the share, make note of the _Export locations.Path_ value.
- Create a new share, under *Project->Share->Shares*, and mark it private.
- Click on the share, make note of the *Export locations.Path* value.
- Edit the `vault_cephfs_export_path` variable to match the value from the previous step.
- On the main _Shares_ page click the down arrow on the side of the _EDIT SHARE_
button and go to _Manage Rules_.
- On the main *Shares* page click the down arrow on the side of the *EDIT SHARE*
button and go to *Manage Rules*.
- Add a new rule and once created make note of the _Access Key` value.
- Edit the `vault_cephfs_access_secret` variable to match the value from the previous step.

## Object store

_Used for: Persistent storage of user data._
*Used for: Persistent storage of user data.*

This is currently expected to be configured to use the Echo object store.
The S3 endpoint is configured through the `s3_endpoint` ansible variable
in `<repo_root>/ansible/group_vars/all/s3.yml`.
in [infra/ansible](inventories/qa/group_vars/all/all.yml).

An access key and secret are configured in the vault. They cannot be managed through
the Openstack web interface, instead new keys and secrets are created using the
`openstack ec2 credentials` command.

In the `<repo_root>/ansible` directory run `uv run openstack --os-cloud=<cloud_name> ec2 credentials create`
In the [infra/ansible](../../infra/ansible) directory run `uv run openstack --os-cloud=<cloud_name> ec2 credentials create`
to create a new access key/secret pair. Update the Ansible vault accordingly.

## Networking

A floating IP is required for the Traefik load balancer node.

Using the web interface create one from _Project->Network->Floating IPs_, using _ALLOCATE IP TO PROJECT_, ensuring
a description is provided.
Requirements:

Update the value of `openstack_reverse_proxy_fip` in `<repo_root>/ansible/group_vars/all/openstack.yml`.
The `openstack_reverse_proxy_fip` value must match the value configured
in the DNS record for the domain defined in `<repo_root>/ansible/group_vars/all/domains.yml`
- floating IP:
- Using the web interface create one from *Project->Network->Floating IPs*,
using *ALLOCATE IP TO PROJECT*, ensuring a description is provided.
- Place the value in the Terraform [environments tf vars file](../../infra/ansible/terraform/environments).
- DNS record pointing at the above floating IP
- Place the value in [inventories/dev/group_vars/all/all.yml](../../infra/ansible/inventories/dev/group_vars/all/all.yml).
2 changes: 1 addition & 1 deletion docs-devel/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ the future.
├── elt-common/ # Reusable Python package with common ETL/ELT helpers used by the warehouses
├── docs/ # User and developer documentation using MkDocs. See `docs/src` for content used in the published docs site.
├── infra/
│ ├── ansible-docker/ # Ansible playbooks/roles to deploy the system to the STFC (OpenStack) cloud.
│ ├── ansible/ # Ansible playbooks/roles to deploy the system to the STFC (OpenStack) cloud.
│ ├── container-images/ # Container definitions deployed services
│ └── local/ # docker-compose configuration for a local development environment and end-to-end CI tests.
└── warehouses/ # One subdirectory per (Lakekeeper) warehouse. Each contains ELT code to extract, transform and load data from external sources into Iceberg tables.
Expand Down
21 changes: 18 additions & 3 deletions elt-common/src/elt_common/iceberg/maintenance/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,29 @@
from typing import Sequence

import click
from elt_common.iceberg.trino import TrinoCredentials, TrinoQueryEngine
from elt_common.iceberg.trino import TrinoQueryEngine
from pydantic_settings import BaseSettings, SettingsConfigDict

ENV_PREFIX = "ELT_COMMON_ICEBERG_MAINT_TRINO_"

ENV_PREFIX = ""
LOG_FORMAT = "%(asctime)s:%(module)s:%(levelname)s:%(message)s"
LOG_FORMAT_DATE = "%Y-%m-%dT%H:%M:%S"

LOGGER = logging.getLogger(__name__)


class TrinoCredentials(BaseSettings):
model_config = SettingsConfigDict(env_prefix=ENV_PREFIX)

host: str
port: str
catalog: str
user: str | None
password: str | None
http_scheme: str = "https"
verify: bool = True


class IcebergTableMaintenaceSql:
"""See https://trino.io/docs/current/connector/iceberg.html#alter-table-execute"""

Expand Down Expand Up @@ -66,7 +80,8 @@ def cli(table: Sequence[str], retention_threshold: str, log_level: str):
)
LOGGER.setLevel(log_level)

trino = TrinoQueryEngine(TrinoCredentials.from_env(ENV_PREFIX))
trino_creds = TrinoCredentials() # type: ignore
trino = TrinoQueryEngine(**trino_creds.model_dump(mode="python"))
iceberg_maintenance = IcebergTableMaintenaceSql(trino)

if not table:
Expand Down
56 changes: 16 additions & 40 deletions elt-common/src/elt_common/iceberg/trino.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
import contextlib
import dataclasses
import logging
import os
import re
from typing import Sequence

Expand All @@ -15,33 +13,6 @@
LOGGER = logging.getLogger(__name__)


@dataclasses.dataclass
class TrinoCredentials:
host: str
port: str
catalog: str
user: str | None
password: str | None
http_scheme: str = "https"

@classmethod
def from_env(cls, env_prefix: str) -> "TrinoCredentials":
def _get_env(field: dataclasses.Field):
key = f"{env_prefix}{field.name.upper()}"
val = os.getenv(key)
if val is not None:
return val
elif field.default is not dataclasses.MISSING:
return field.default
elif getattr(field, "default_factory", dataclasses.MISSING) is not dataclasses.MISSING:
return field.default_factory() # type: ignore
else:
raise KeyError(f"Missing required environment variable: {key}")

kwargs = {f.name: _get_env(f) for f in dataclasses.fields(cls)}
return cls(**kwargs)


class TrinoQueryEngine:
@property
def engine(self) -> Engine:
Expand All @@ -51,10 +22,19 @@ def engine(self) -> Engine:
def url(self) -> str:
return self._url

def __init__(self, credentials: TrinoCredentials):
def __init__(
self,
host: str,
port: str,
catalog: str,
user: str,
password: str,
http_scheme="https",
verify=True,
):
"""Initlialize an object and create an Engine"""
self._url = f"trino://{credentials.host}:{credentials.port}/{credentials.catalog}"
self._engine = self._create_engine(credentials)
self._url = f"trino://{host}:{port}/{catalog}"
self._engine = self._create_engine(user, password, http_scheme=http_scheme, verify=verify)

def execute(self, stmt: str, connection: Connection | None = None):
"""Execute a SQL statement and return the results.
Expand Down Expand Up @@ -103,17 +83,13 @@ def validate_retention_threshold(cls, retention_threshold: str):
raise ValueError(f"Invalid retention threshold format: {retention_threshold}")

# private
def _create_engine(self, credentials: TrinoCredentials) -> Engine:
if credentials.user is None or credentials.password is None:
def _create_engine(self, user: str, password: str, **connect_args) -> Engine:
if user is None or password is None:
auth = BasicAuthentication("trino", "")
else:
auth = BasicAuthentication(credentials.user, credentials.password)
auth = BasicAuthentication(user, password)

return create_engine(
self.url,
connect_args={
"auth": auth,
"http_scheme": credentials.http_scheme,
"verify": False,
},
connect_args=dict(auth=auth, **connect_args),
)
Loading