Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
203 changes: 146 additions & 57 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,37 @@
# Semantic Medallion Data Platform
# Semantic Medallion Data Platform - Research Project (Derby University)

A modern data platform implementing the medallion architecture with local processing.
## Abstract

This study explores how to improve data lineage, attribute resolution, and contextual understanding across heterogeneous
data sources by incorporating Natural Language Processing (NLP) technologies into contemporary data processing
pipelines. Conventional constraints and keys are the mainstays of traditional Extract, Transform, Load (ETL) pipelines
for attribute resolution; these techniques are insufficiently sophisticated for increasingly complex and varied data
environments. By creating and putting into practice intelligent data processing pipelines that make use of NLP
capabilities within a medallion architecture framework, this study overcomes this limitation.

In order to extract individuals, organisations, and locations from real-time news data acquired through NewsAPI, the
research uses spaCy for named entity recognition as part of a practical implementation approach. In order to replicate
cloud-native solutions, the system architecture combines Hugging Face NLP models with Apache Spark processing
capabilities, which are deployed in containerised environments. PostgreSQL databases are used to store the results of
data processing, and Metabase offers reporting and visualisation features to show how effective the pipeline is.

The approach focusses on creating NLP-enhanced ETL pipelines that extract keywords for topic affiliation, entity
extraction to create intelligent data lineages across various sources, and sentiment analysis on unstructured blob data.
Performance evaluation measures improvements in attribute resolution accuracy, processing efficiency, and data lineage
completeness by contrasting NLP-integrated pipelines with conventional data processing techniques.

Key findings show that while retaining scalable processing performance, NLP integration greatly improves automated data
attribute resolution capabilities. The study shows quantifiable gains in contextual comprehension and data
categorisation accuracy, with entity extraction mechanisms effectively creating more intelligent data lineages than
traditional techniques. However, when integrating NLP processing in production ETL environments, the implications for
computational overhead must be carefully taken into account.

By offering concrete proof of the advantages of integrating NLP into enterprise data processing systems, this work
advances the expanding field of AI-enhanced data engineering. By bridging the gap between theoretical AI capabilities
and practical data engineering challenges, the research provides a repeatable framework for businesses looking to add
intelligent, context-aware capabilities to their data processing infrastructure.

---

## Architecture Overview

Expand All @@ -14,74 +45,72 @@ This project implements a medallion architecture for data lakes, which organizes

```mermaid
graph TD
%% Bronze Layer
%% Bronze Layer
subgraph "Bronze Layer"
B1[brz_01_extract_newsapi.py]
B2[brz_01_extract_known_entities.py]
end

%% Silver Layer
%% Silver Layer
subgraph "Silver Layer"
S1[slv_02_transform_nlp_known_entities.py]
S2[slv_02_transform_nlp_newsapi.py]
S3[slv_03_transform_entity_to_entity_mapping.py]
S4[slv_02_transform_sentiment_newsapi.py]
end

%% Gold Layer
%% Gold Layer
subgraph "Gold Layer"
G1[gld_04_load_entities.py]
G2[gld_04_load_newsapi.py]
G3[gld_04_load_newsapi_sentiment.py]
end

%% Data Sources
%% Data Sources
NewsAPI[NewsAPI] --> B1
KnownEntitiesCSV[Known Entities CSV] --> B2

%% Bronze to Silver
B1 --> |bronze.newsapi| S2
B2 --> |bronze.known_entities| S1

%% Silver Processing
S1 --> |silver.known_entities_entities| S3
S2 --> |silver.newsapi_entities| S3

%% Silver to Gold
S1 --> |silver.known_entities| G1
S1 --> |silver.known_entities| G2
S3 --> |silver.entity_to_entity_mapping| G1
S3 --> |silver.entity_to_source_mapping| G2
S2 --> |silver.newsapi| G2

%% Gold to Reporting
G1 --> |gold.entity_affiliations_complete| Metabase[Metabase Dashboards]
G2 --> |gold.entity_to_newsapi| Metabase
end
%% Bronze to Silver
B1 -->|bronze . newsapi| S2
B2 -->|bronze . known_entities| S1
%% Silver Processing
S1 -->|silver . known_entities_entities| S3
S2 -->|silver . newsapi_entities| S3
%% Silver to Gold
S1 -->|silver . known_entities| G1
S1 -->|silver . known_entities| G2
S1 -->|silver . known_entities| G3
S3 -->|silver . entity_to_entity_mapping| G1
S3 -->|silver . entity_to_source_mapping| G2
S3 -->|silver . entity_to_source_mapping| G3
S2 -->|silver . newsapi| G2
S2 -->|silver . newsapi| G3
S2 -->|silver . newsapi| S4
S4 -->|silver . newsapi_sentiment| G3
%% Gold to Reporting
G1 -->|gold . entity_affiliations_complete| Metabase[Metabase Dashboards]
G2 -->|gold . entity_to_newsapi| Metabase
G3 -->|gold . entity_to_newsapi_sentiment| Metabase
```

### System Architecture

```mermaid
graph LR
%% External Systems
%% External Systems
NewsAPI[NewsAPI]
CSVFiles[CSV Files]

%% Data Processing
%% Data Processing
PySpark[PySpark Processing]

%% Storage
%% Storage
PostgreSQL[PostgreSQL Database]

%% Visualization
%% Visualization
Metabase[Metabase]

%% Flow
%% Flow
NewsAPI --> PySpark
CSVFiles --> PySpark
PySpark --> PostgreSQL
PostgreSQL --> Metabase

%% Layers within PostgreSQL
%% Layers within PostgreSQL
subgraph PostgreSQL
Bronze[Bronze Schema]
Silver[Silver Schema]
Expand All @@ -95,8 +124,8 @@ graph LR

- **Data Processing**: PySpark
- **Database**: PostgreSQL
- **Orchestration**: None
- **Transformation**: pyspark
- **NLP & Sentiment Analysis**: spaCy, Hugging Face Transformers
- **Reporting & Visualization**: Metabase
- **Local Development**: Docker, Poetry
- **External APIs**: NewsAPI
Expand Down Expand Up @@ -166,7 +195,8 @@ semantic-medallion-data-platform/
cp .env.example .env
```

Edit the `.env` file to set your database credentials and other environment variables. Make sure to set your NewsAPI key if you plan to use the news article extraction functionality:
Edit the `.env` file to set your database credentials and other environment variables. Make sure to set your NewsAPI
key if you plan to use the news article extraction functionality:
```
NEWSAPI_KEY=your_newsapi_key_here
```
Expand All @@ -183,9 +213,22 @@ docker-compose up -d
```

This will start:

- Local PostgreSQL database
- Metabase (data visualization and reporting tool) accessible at http://localhost:3000

### Running the ETL Pipeline

You can run the entire ETL pipeline using the provided shell script:

```bash
./local_run.sh
```

This script will execute all the necessary steps in the correct order, from Bronze to Gold layer.

Alternatively, you can run each step individually as described in the sections below.

### Running Tests

```bash
Expand All @@ -204,6 +247,7 @@ python -m semantic_medallion_data_platform.bronze.brz_01_extract_newsapi --days_
```

This will:

1. Fetch known entities from the database
2. Query NewsAPI for articles mentioning each entity
3. Store the articles in the bronze.newsapi table
Expand All @@ -218,6 +262,7 @@ python -m semantic_medallion_data_platform.bronze.brz_01_extract_known_entities
```

This will:

1. Read entity data from CSV files in the specified directory
2. Process and transform the data
3. Store the entities in the bronze.known_entities table
Expand All @@ -234,6 +279,7 @@ python -m semantic_medallion_data_platform.silver.slv_02_transform_nlp_known_ent
```

This will:

1. Copy known entities from bronze.known_entities to silver.known_entities
2. Extract entities (locations, organizations, persons) from entity descriptions using NLP
3. Store the extracted entities in the silver.known_entities_entities table
Expand All @@ -248,6 +294,7 @@ python -m semantic_medallion_data_platform.silver.slv_02_transform_nlp_newsapi
```

This will:

1. Copy news articles from bronze.newsapi to silver.newsapi
2. Extract entities from article title, description, and content using NLP
3. Store the extracted entities in the silver.newsapi_entities table
Expand All @@ -262,11 +309,31 @@ python -m semantic_medallion_data_platform.silver.slv_03_transform_entity_to_ent
```

This will:

1. Create entity-to-source mappings between known_entities_entities and newsapi_entities
2. Create entity-to-entity mappings within known_entities_entities using fuzzy matching
3. Store the mappings in silver.entity_to_source_mapping and silver.entity_to_entity_mapping tables

The entity mapping process uses fuzzy matching with RapidFuzz to identify similar entities across different data sources. This enables semantic connections between entities even when there are slight variations in naming or formatting.
The entity mapping process uses fuzzy matching with RapidFuzz to identify similar entities across different data
sources. This enables semantic connections between entities even when there are slight variations in naming or
formatting.

#### Processing News Articles with Sentiment Analysis

To analyze sentiment in news articles:

```bash
cd semantic-medallion-data-platform
python -m semantic_medallion_data_platform.silver.slv_02_transform_sentiment_newsapi
```

This will:

1. Read news articles from silver.newsapi
2. Apply sentiment analysis to the content of each article using Hugging Face Transformers
3. Store the sentiment scores and labels in the silver.newsapi_sentiment table

The sentiment analysis process uses a pre-trained BERT model to classify the sentiment of each article as positive, negative, or neutral, along with a confidence score.

### Running Gold Layer Processes

Expand All @@ -280,6 +347,7 @@ python -m semantic_medallion_data_platform.gold.gld_04_load_entities
```

This will:

1. Join entity-to-entity mappings with known entities data
2. Create a wide table with entity information and their fuzzy match affiliations
3. Create a bidirectional relationship table for complete entity affiliation analysis
Expand All @@ -295,19 +363,38 @@ python -m semantic_medallion_data_platform.gold.gld_04_load_newsapi
```

This will:

1. Join entity-to-source mappings with known entities and NewsAPI data
2. Create a wide table with entity information and their mentions in news sources
3. Store the results in gold.entity_to_newsapi table

These gold layer tables provide the foundation for the analytics and visualizations in Metabase dashboards, enabling comprehensive entity analysis and news source insights.
#### Creating NewsAPI Sentiment Analysis Table

To create a table with sentiment analysis of news articles:

```bash
cd semantic-medallion-data-platform
python -m semantic_medallion_data_platform.gold.gld_04_load_newsapi_sentiment
```

This will:

1. Join news articles from silver.newsapi with sentiment analysis from silver.newsapi_sentiment
2. Create a table with article information and their sentiment scores and labels
3. Store the results in gold.entity_to_newsapi_sentiment table

These gold layer tables provide the foundation for the analytics and visualizations in Metabase dashboards, enabling
comprehensive entity analysis, news source insights, and sentiment analysis.

## Contributing

Please read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests.
Please read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull
requests.

## Infrastructure Setup with Terraform

This project uses Terraform to manage infrastructure on Digital Ocean, including a PostgreSQL database. Follow these steps to set up the infrastructure:
This project uses Terraform to manage infrastructure on Digital Ocean, including a PostgreSQL database. Follow these
steps to set up the infrastructure:

### Prerequisites

Expand Down Expand Up @@ -349,28 +436,28 @@ This project uses Terraform to manage infrastructure on Digital Ocean, including
```

7. After successful application, Terraform will output connection details for your PostgreSQL database:
- Database host
- Database port
- Database name
- Database user
- Database password (sensitive)
- Database URI (sensitive)
- Database host
- Database port
- Database name
- Database user
- Database password (sensitive)
- Database URI (sensitive)

### Infrastructure Components

The Terraform configuration creates the following resources on Digital Ocean:

- **PostgreSQL Database Cluster**:
- Version: PostgreSQL 15
- Size: db-s-1vcpu-1gb (1 vCPU, 1GB RAM)
- Region: Configurable (default: London - lon1)
- Node Count: 1
- Version: PostgreSQL 15
- Size: db-s-1vcpu-1gb (1 vCPU, 1GB RAM)
- Region: Configurable (default: London - lon1)
- Node Count: 1

- **Database**:
- Name: semantic_data_platform
- Name: semantic_data_platform

- **Database User**:
- Name: semantic_app_user
- Name: semantic_app_user

### Managing Infrastructure

Expand All @@ -391,7 +478,8 @@ For deployment instructions, see [DEPLOYMENT.md](docs/DEPLOYMENT.md).

## Visualization and Reporting

The Semantic Medallion Data Platform includes built-in visualization and reporting capabilities using Metabase. The gold layer tables are designed to be easily consumed by Metabase for creating dashboards and reports.
The Semantic Medallion Data Platform includes built-in visualization and reporting capabilities using Metabase. The gold
layer tables are designed to be easily consumed by Metabase for creating dashboards and reports.

### Metabase Dashboards

Expand Down Expand Up @@ -436,7 +524,8 @@ The platform provides a rich set of analytics capabilities through SQL queries t
- **Entity Mention Context Analysis**: Analysis of the context of entity mentions
- **Entity Type Correlation**: Correlation between different entity types

These queries are stored in the `data/metabase_questions/` directory and can be imported into Metabase for creating custom dashboards and reports.
These queries are stored in the `data/metabase_questions/` directory and can be imported into Metabase for creating
custom dashboards and reports.

### Database Client View

Expand Down
Loading