This repository contains an Apache Airflow DAG that scrapes the Most Active Stocks table from Yahoo Finance and stores the top‑30 rows in a PostgreSQL table each day.
│  airflow/
│   └── dags/
│       └── fetch_and_store_most_active_stocks.py  # the DAG described below
│
└── README.md # you are here
| Step | Task ID | Operator | Description |
|---|---|---|---|
| 1 | fetch_stock_data |
PythonOperator |
Scrape https://finance.yahoo.com/markets/stocks/most-active and collect the first 30 rows |
| 2 | create_stocks_table |
PostgresOperator |
Create most_active_stocks table if it does not exist |
| 3 | insert_stock_data |
PythonOperator |
UPSERT the scraped rows into Postgres (ONÂ CONFLICT(symbol)Â DOÂ UPDATE) |
The table schema:
symbol TEXT PRIMARY KEY,
name TEXT,
price NUMERIC,
change NUMERIC,
change_pct NUMERIC,
volume BIGINT,
avg_volume_3m BIGINT,
market_cap TEXT,
pe_ratio NUMERIC,
ingested_at TIMESTAMP DEFAULT NOW()| Tool | Version | Notes |
|---|---|---|
| Python | 3.9Â + | Pinned in Airflow image |
| Apache Airflow | 2.6 + | Any executor (Local, Celery, etc.) |
| PostgreSQL | 12Â + | Can be external RDS or local container |
| Requests / BeautifulSoup | Installed via DAG requirements |
###Â Airflow Connection
Create a Postgres connection in the Airflow UI named stocks_connection pointing to your database. Example URI:
postgresql://airflow:airflow@postgres:5432/airflow
-
Clone / copy the DAG file into your
AIRFLOW_HOME/dagsfolder. -
Ensure the Airflow worker image has
requestsandbeautifulsoup4installed, e.g. viaÂpip install -r requirements.txtwith:requests>=2.31 beautifulsoup4>=4.12 pandas>=2.2
-
Add the
stocks_connectionconnection. -
Trigger the DAG manually or wait for the scheduled daily run.
The DAG is configured with schedule_interval=timedelta(days=1) and catchup=False, meaning it runs once per day at the Airflow default start time.