You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A simple, production-friendly Streamlit web app that recursively crawls links from a start page up to a specified maximum depth.
It supports BFS crawling, obeys robots.txt, filters non-HTML resources, and exports results as CSV.
โจ Features
Feature
Description
๐ Recursive BFS crawling
Crawl links up to a configurable max depth
๐งญ Domain scoping
Restrict to same domain or include subdomains
๐ซ robots.txt awareness
Toggle robots.txt compliance on or off
๐ฐ๏ธ Rate limiting & timeouts
Polite crawling with delays and request timeouts
๐งน Smart URL normalization
Normalize and deduplicate URLs
๐๏ธ Binary content filtering
Skip images, videos, docs, and other non-HTML resources
๐ Live results table
View depth, status, content type, and notes in real time
๐พ Download results as CSV
Export crawl results
โ๏ธ Configurable via sidebar
All options adjustable in the Streamlit sidebar
๐ Project Structure
File/Folder
Description
app.py
Main Streamlit application
requirements.txt
Python dependencies
๐งฐ Requirements
Requirement
Details
Python
3.9+
streamlit
>=1.32
requests
>=2.31
beautifulsoup4
>=4.12
pandas
>=2.0
๐ How to Run
# 1. Clone or download the repository
git clone https://github.com/your-username/recursive-link-crawler.git
cd recursive-link-crawler
# 2. (Optional) Create and activate a virtual environment
python -m venv .venv
# On Windows:
.venv\Scripts\activate
# On macOS/Linux:source .venv/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Run the Streamlit app
streamlit run app.py