Skip to content

akash012-ctrl/recursive-web_crawl-streamlit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ•ท๏ธ Recursive Link Crawler (Streamlit)

Crawler screenshot

A simple, production-friendly Streamlit web app that recursively crawls links from a start page up to a specified maximum depth.
It supports BFS crawling, obeys robots.txt, filters non-HTML resources, and exports results as CSV.


โœจ Features

Feature Description
๐ŸŒ Recursive BFS crawling Crawl links up to a configurable max depth
๐Ÿงญ Domain scoping Restrict to same domain or include subdomains
๐Ÿšซ robots.txt awareness Toggle robots.txt compliance on or off
๐Ÿ•ฐ๏ธ Rate limiting & timeouts Polite crawling with delays and request timeouts
๐Ÿงน Smart URL normalization Normalize and deduplicate URLs
๐Ÿ—‚๏ธ Binary content filtering Skip images, videos, docs, and other non-HTML resources
๐Ÿ“Š Live results table View depth, status, content type, and notes in real time
๐Ÿ’พ Download results as CSV Export crawl results
โš™๏ธ Configurable via sidebar All options adjustable in the Streamlit sidebar

๐Ÿ“ Project Structure

File/Folder Description
app.py Main Streamlit application
requirements.txt Python dependencies

๐Ÿงฐ Requirements

Requirement Details
Python 3.9+
streamlit >=1.32
requests >=2.31
beautifulsoup4 >=4.12
pandas >=2.0

๐Ÿš€ How to Run

# 1. Clone or download the repository
git clone https://github.com/your-username/recursive-link-crawler.git
cd recursive-link-crawler

# 2. (Optional) Create and activate a virtual environment
python -m venv .venv
# On Windows:
.venv\Scripts\activate
# On macOS/Linux:
source .venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Run the Streamlit app
streamlit run app.py

โš™๏ธ Configuration Options

Setting Description
Start URL Initial page to start crawling
Max Depth Maximum recursion level
Restrict to Same Domain Limit crawling to the same host
Include Subdomains Include links from subdomains
Delay Between Requests Delay (seconds) between requests
Request Timeout Maximum wait time for a response
Respect Robots.txt Skip URLs blocked by robots.txt
User-Agent Identify your crawler politely

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages