Scrape paid or free articles from a Substack newsletter, saving both HTML and Markdown versions.
-
Clone the repository and enter the directory:
git clone https://github.com/gitgithan/substack_scraper.git cd substack_scraper -
(Recommended) Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate -
Install dependencies:
pip install requests beautifulsoup4 lxml markdownify selenium -
Install ChromeDriver and ensure it's in your PATH.
-
Edit
substack_scraper.py:- Set
BASE_URLto your newsletter's main URL (e.g.,https://newsletter.eng-leadership.com) - Set
SITEMAP_STRINGto the sitemap path (e.g.,/sitemap.xml)
- Set
-
Scrape free articles:
python substack_scraper.py -
Scrape paid articles (manual login required):
python substack_scraper.py --paidThis will launch a browser for you to log in manually (doesn't matter email OTP or with password). After solving captcha and logging in, press Enter in the terminal to continue scraping.
Note: If paid content does not load correctly, you may need to increase the sleep duration in the script (see
sleep()inscrape_article_selenium). Paid articles sometimes take longer to render after login.
- HTML files:
html_files/ - Markdown files:
md_files/ - Article metadata:
articles.json - List of URLs:
urls.txt