Skip to content

YanCheng-go/demo_airflow_etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

demo_airflow_etl

A small demo project showcasing how to use airflow for ETL pipelines

Table of Contents

Project Overview

This project is designed to demonstrate the capabilities of Apache Airflow in managing ETL (Extract, Transform, Load) pipelines. It provides a practical example of how to set up and orchestrate data workflows, making it easier to understand the ETL process and the tools involved.

Project Structure

.
├── airflow  -> Contains Airflow DAGs and configurations
├── backend  -> Backend services for data retrieval and processing
├── docs  -> Documentation for the project
├── jobs  -> Contains job definitions and configurations
├── LICENSE  -> License file for the project
├── README.md  -> Project overview and documentation
├── requirements.txt  -> Python dependencies for the project
├── setup.py  -> Setup script for the project
└── utils  -> Utility functions and scripts

Demo Project: Airflow ETL Pipeline

This project demonstrates how to set up an ETL pipeline using Apache Airflow. The pipeline extracts data from various sources, transforms it into a clean and standardized format, and loads it into a target system for analysis.

ETL: Extract, Transform, Load – Overview

Extract

  • Goal: Pull raw data from various sources (e.g., databases, flat files, APIs).
  • Common Sources: CRM systems, relational databases, JSON/CSV files, cloud data services.
  • Why a Staging Area? To buffer and preserve raw data before transformation in case errors occur.
  • Format Handling: Compatible with structured (SQL) and unstructured formats (JSON, APIs, etc.).

Transform

  • Goal: Clean and standardize data into a unified schema.
  • Key Tasks:
    • Cleaning: Fix nulls, correct inconsistencies, remove duplicates and unnecessary attributes
    • Auditing: Ensure accuracy, completeness, and consistency.
    • Compliance: Protect sensitive data to meet laws (e.g., GDPR, HIPAA).
    • Formatting: Join tables, reformat columns, standardize types (e.g., date/currency).
      • Normalization:
        • Convert data types (e.g., string to integer, date formats)
        • Standardize units (e.g., convert all distances to meters)
        • String manipulations
          • Trimming whitespace, removing special characters
          • Converting to lowercase/uppercase
        • Replace code with text
          • E.g., replace country code with country name
        • Further normalization for consistency
          • Look-up tables, dictionaries to find valid data, synonyms, abbreviations
          • Normalize spellings of names, addresses, etc.
      • Building keys:
        • Combine data from multiple sources with common keys
        • Table matches production keys to surrogate DW keys
        • Correct handling of history - especially for total reload

Load

  • Goal: Move cleaned, transformed data to a target system (data warehouse/lake).

  • Types:

    • Initial Load: Full data migration.
    • Incremental Load: Only changes/new data.
  • Scheduling: Typically automated via batch jobs or streaming systems.

  • Outcome: Central, analysis-ready data repository for dashboards, analytics, or ML.


Common Tech Stack

The Tech stack used in this demo project is bolded below, but there are many alternatives available for each task.

Extract

Task Tools
APIs, Files, Databases **Python**, **Requests**, **pandas**, **BeautifulSoup**, Selenium
Relational DBs SQLAlchemy, psycopg2, pyodbc
NoSQL / Cloud MongoDB Connector, **boto3** (for S3), Google Cloud Storage SDK
Stream Data Kafka, Flink, Apache NiFi, Spark Structured Streaming
Workflow Triggers **Apache Airflow**, Dagster, Prefect

Transform

Task Tools
Batch Transformation **PySpark**, **pandas**, dbt, SQL
Large-scale Transform **Apache Spark**, Apache Beam, Dask, Flink
Geospatial Data **GeoPandas**, **Shapely**, **PostGIS**, Osmnx
Cleaning + QA Great Expectations, pandera, pyspark.sql.functions
Business Logic Layer dbt (data build tool, for analytics-ready models)

Load

Target Tools
Data Lakes Amazon S3, Delta Lake, Apache Hudi, Iceberg
Data Warehouses Snowflake, Google BigQuery, Redshift, **PostgreSQL**, Azure Synapse
Real-Time Store **Redis**, ClickHouse, Elasticsearch
Geospatial Store **PostGIS**, **GeoParquet**, Tile38

Orchestration & Scheduling

Tool Description
Airflow Popular for DAG-based ETL scheduling and task dependencies
Dagster Modern orchestration with strong typing and observability
Prefect Lightweight alternative to Airflow, Python-native
Luigi Simpler pipelines with dependency chaining

Data Formats

Format Use Case
Parquet Columnar, efficient for large-scale processing
CSV Simple, but not optimized for performance
JSON Good for semi-structured data or API responses
Avro / ORC Often used with Kafka or Hadoop ecosystems

Credentials & Secrets

Tool Use Case
AWS Secrets Manager / GCP Secret Manager Managed secrets
.env + dotenv Local development or simple deployments
Vault (HashiCorp) Enterprise-grade secret management

Monitoring / CI/CD

Tool Description
GitHub Actions / GitLab CI CI/CD automation pipelines
Prometheus + Grafana Metrics + dashboards for pipeline health
Sentry, PagerDuty Alerting & error logging

Further Reading

  1. ETL PPT
  2. Airflow Documentation

About

A small demo project showcasing how to use airflow for ETL pipelines

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages