Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 0 additions & 118 deletions .github/workflows/ci-cd.yml

This file was deleted.

54 changes: 54 additions & 0 deletions .github/workflows/pr-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
name: PR Tests

on:
pull_request:
branches: [ main ]

jobs:
test:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:14
env:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: medallion
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5

steps:
- uses: actions/checkout@v3

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.9'

- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: 1.5.1
virtualenvs-create: true
virtualenvs-in-project: true

- name: Install dependencies
run: poetry install --no-interaction

- name: Install Spacy
run: poetry run python -m spacy download en_core_web_lg

- name: Run tests
run: poetry run pytest
env:
DB_HOST: localhost
DB_PORT: 5432
DB_NAME: medallion
DB_USER: postgres
DB_PASSWORD: postgres
NEWSAPI_KEY: ${{ secrets.NEWSAPI_KEY }}
26 changes: 26 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,32 @@
# Created by https://www.toptal.com/developers/gitignore/api/python,spark,terraform,pycharm,git
# Edit at https://www.toptal.com/developers/gitignore?templates=python,spark,terraform,pycharm,git

### Terraform ###
# Local .terraform directories
**/.terraform/*

# .tfstate files
*.tfstate
*.tfstate.*

# Crash log files
crash.log
crash.*.log

# Exclude all .tfvars files, which are likely to contain sensitive data
*.tfvars
*.tfvars.json

# Ignore override files as they are usually used for local dev
override.tf
override.tf.json
*_override.tf
*_override.tf.json

# Ignore CLI configuration files
.terraformrc
terraform.rc

### Git ###
# Created by git for backups. To disable backups in Git:
# $ git config --global mergetool.keepBackup false
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ repos:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-added-large-files
# - id: check-added-large-files
Comment thread
ByteMeDirk marked this conversation as resolved.
- id: check-toml
- id: check-json

Expand Down
99 changes: 95 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,14 @@ This project implements a medallion architecture for data lakes, which organizes

## Tech Stack

- **Data Processing**: PySpark, Delta Lake
- **Data Processing**: PySpark
- **Database**: PostgreSQL
- **Orchestration**: Prefect
- **Transformation**: dbt
- **Data Quality**: Great Expectations
- **Orchestration**: None
- **Transformation**: pyspark
- **Reporting & Visualization**: Metabase
- **Local Development**: Docker, Poetry
- **External APIs**: NewsAPI
- **Infrastructure as Code**: Terraform

## Project Structure

Expand All @@ -29,6 +29,13 @@ semantic-medallion-data-platform/
├── data/ # Data files
│ └── known_entities/ # Known entities data files
├── docs/ # Documentation
├── infrastructure/ # Infrastructure as Code
│ └── terraform/ # Terraform configuration for Digital Ocean
│ ├── main.tf # Main Terraform configuration
│ ├── variables.tf # Variable definitions
│ ├── outputs.tf # Output definitions
│ ├── terraform.tfvars.example # Example variables file
│ └── setup.sh # Setup script for Terraform
├── semantic_medallion_data_platform/ # Main package
│ ├── bronze/ # Bronze layer processing
│ │ ├── brz_01_extract_newsapi.py # Extract news articles from NewsAPI
Expand Down Expand Up @@ -183,6 +190,90 @@ This will:

Please read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests.

## Infrastructure Setup with Terraform

This project uses Terraform to manage infrastructure on Digital Ocean, including a PostgreSQL database. Follow these steps to set up the infrastructure:

### Prerequisites

- [Terraform](https://www.terraform.io/downloads.html) (version 1.0.0 or later)
- [Digital Ocean account](https://www.digitalocean.com/)
- [Digital Ocean API token](https://cloud.digitalocean.com/account/api/tokens)

### Setup Instructions

1. Navigate to the Terraform directory:
```bash
cd infrastructure/terraform
```

2. Create a `terraform.tfvars` file from the example:
```bash
cp terraform.tfvars.example terraform.tfvars
```

3. Edit the `terraform.tfvars` file to add your Digital Ocean API token:
```bash
# Open with your favorite editor
nano terraform.tfvars
```

4. Initialize Terraform:
```bash
terraform init
```

5. Plan the infrastructure changes:
```bash
terraform plan -out=tfplan
```

6. Apply the infrastructure changes:
```bash
terraform apply tfplan
```

7. After successful application, Terraform will output connection details for your PostgreSQL database:
- Database host
- Database port
- Database name
- Database user
- Database password (sensitive)
- Database URI (sensitive)

### Infrastructure Components

The Terraform configuration creates the following resources on Digital Ocean:

- **PostgreSQL Database Cluster**:
- Version: PostgreSQL 15
- Size: db-s-1vcpu-1gb (1 vCPU, 1GB RAM)
- Region: Configurable (default: London - lon1)
- Node Count: 1

- **Database**:
- Name: semantic_data_platform

- **Database User**:
- Name: semantic_app_user

### Managing Infrastructure

- To update the infrastructure after making changes to the Terraform files:
```bash
terraform plan -out=tfplan # Preview changes
terraform apply tfplan # Apply changes
```

- To destroy the infrastructure when no longer needed:
```bash
terraform destroy
```

For more detailed information about the infrastructure setup, see [INFRASTRUCTURE.md](docs/INFRASTRUCTURE.md).

For deployment instructions, see [DEPLOYMENT.md](docs/DEPLOYMENT.md).

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
Loading