Skip to content

A daily Airflow ETL pipeline that scrapes the top-30 most-active stocks from Yahoo Finance and upserts them into a Postgres table.

Notifications You must be signed in to change notification settings

namantiwari2002/StockData_ETL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ETL Stocks — Most‑Active Stocks DAG

This repository contains an Apache Airflow DAG that scrapes the Most Active Stocks table from Yahoo Finance and stores the top‑30 rows in a PostgreSQL table each day.


🗺️ Project Structure

│  airflow/
│   └── dags/
│       └── fetch_and_store_most_active_stocks.py  # the DAG described below
│
└── README.md                                      # you are here

⚙️ What the DAG Does

Step Task ID Operator Description
1 fetch_stock_data PythonOperator Scrape https://finance.yahoo.com/markets/stocks/most-active and collect the first 30 rows
2 create_stocks_table PostgresOperator Create most_active_stocks table if it does not exist
3 insert_stock_data PythonOperator UPSERT the scraped rows into Postgres (ON CONFLICT(symbol) DO UPDATE)

The table schema:

symbol         TEXT PRIMARY KEY,
name           TEXT,
price          NUMERIC,
change         NUMERIC,
change_pct     NUMERIC,
volume         BIGINT,
avg_volume_3m  BIGINT,
market_cap     TEXT,
pe_ratio       NUMERIC,
ingested_at    TIMESTAMP DEFAULT NOW()

🛠️ Prerequisites

Tool Version Notes
Python 3.9 + Pinned in Airflow image
Apache Airflow 2.6 + Any executor (Local, Celery, etc.)
PostgreSQL 12 + Can be external RDS or local container
Requests / BeautifulSoup Installed via DAG requirements

### Airflow Connection Create a Postgres connection in the Airflow UI named stocks_connection pointing to your database. Example URI:

postgresql://airflow:airflow@postgres:5432/airflow

🚀 Setup & Run

  1. Clone / copy the DAG file into your AIRFLOW_HOME/dags folder.

  2. Ensure the Airflow worker image has requests and beautifulsoup4 installed, e.g. via pip install -r requirements.txt with:

    requests>=2.31
    beautifulsoup4>=4.12
    pandas>=2.2
  3. Add the stocks_connection connection.

  4. Trigger the DAG manually or wait for the scheduled daily run.


⏰ Schedule

The DAG is configured with schedule_interval=timedelta(days=1) and catchup=False, meaning it runs once per day at the Airflow default start time.

About

A daily Airflow ETL pipeline that scrapes the top-30 most-active stocks from Yahoo Finance and upserts them into a Postgres table.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages