Crunchbase Scraper

A comprehensive tool for scraping company data from Crunchbase using browser automation to bypass protections like Cloudflare and bot detection.

Features

Browser Automation: Uses Selenium with Chrome WebDriver for stealthy scraping
Data Extraction: Extracts company name, location, description, funding details, and investors
Web Dashboard: Flask-based web interface for easy data viewing and scraping management
JSON Storage: Saves scraped data to JSON files
Logging: Comprehensive logging for debugging and monitoring
Rate Limiting: Built-in delays to avoid detection

Installation

Install Python Dependencies:
```
pip install -r requirements.txt
```
Chrome WebDriver: The webdriver-manager will automatically download the appropriate ChromeDriver.

Usage

Command Line Scraping

Run the scraper directly from the command line:

python scraper.py

This will scrape the default company (OpenAI) and save data to company_data.json.

Web Dashboard

Start the web dashboard:

python dashboard.py

Then open your browser to http://localhost:5000 to:

View scraped company data
Enter new Crunchbase URLs to scrape
Monitor scraping status

Configuration

Adding More Companies

Edit the urls list in scraper.py:

urls = [
    "https://crunchbase.com/organization/openai",
    "https://crunchbase.com/organization/tesla",
    "https://crunchbase.com/organization/microsoft",
    # Add more URLs here
]

Custom Data Fields

Modify the data extraction section in scraper.py to add more fields:

# Add more data extraction
try:
    website_element = driver.find_element(By.CSS_SELECTOR, '.website')
    data['website'] = website_element.text.strip()
except:
    data['website'] = 'N/A'

Stealth Features

The scraper includes several anti-detection measures:

Headless Chrome operation
Randomized user agents
Disabled automation indicators
Realistic browser fingerprints
Rate limiting between requests

Troubleshooting

Common Issues

ChromeDriver Issues: The webdriver-manager should handle this automatically. If not, manually download ChromeDriver.
Cloudflare Blocking: If blocked, try:
- Increasing wait times
- Using different user agents
- Adding proxy support
Data Not Extracting: Crunchbase may have changed their page structure. Inspect the page and update CSS selectors.

Adding Proxy Support

To add proxy support, modify the Chrome options:

chrome_options.add_argument('--proxy-server=http://your-proxy:port')

Data Output

Scraped data is saved to company_data.json in the following format:

{
    "company_name": "OpenAI",
    "location": "San Francisco, California, United States",
    "description": "OpenAI is an AI research and deployment company...",
    "funding_details": "Series A - $1M, Series B - $10M...",
    "investors": "Andreessen Horowitz, Sequoia Capital..."
}

Legal and Ethical Considerations

Respect Crunchbase's Terms of Service
Use reasonable request rates
Consider the legal implications of web scraping
This tool is for educational and research purposes only

Development

Project Structure

crunchbase-scraper/
├── scraper.py          # Main scraping script
├── scraper_advanced.py # Advanced scraping script with undetected-chromedriver
├── scraper_undetected.py # Alternative scraping script
├── dashboard.py        # Web dashboard
├── requirements.txt    # Python dependencies
├── company_data.json   # Scraped data output
├── TODO.md            # Development tasks
└── README.md          # This file

Final Testing and Documentation

Run scraper.py to verify data extraction
Run dashboard.py and test web interface functionality
Verify JSON output format and data accuracy
Update README.md with any final changes or improvements
Mark final testing as complete

Contributing

Fork the repository
Create a feature branch
Make your changes
Test thoroughly
Submit a pull request

License

This project is for educational purposes. Please check local laws regarding web scraping.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
README.md		README.md
TODO.md		TODO.md
company_data.json		company_data.json
company_data_aggregated.json		company_data_aggregated.json
company_data_multisource.json		company_data_multisource.json
company_search.py		company_search.py
config.json		config.json
dashboard.py		dashboard.py
debug_page_source.html		debug_page_source.html
debug_screenshot.png		debug_screenshot.png
get-pip.py		get-pip.py
page_source.html		page_source.html
requirements.txt		requirements.txt
scraper.log		scraper.log
scraper.py		scraper.py
scraper_advanced.py		scraper_advanced.py
scraper_api.py		scraper_api.py
scraper_captcha.py		scraper_captcha.py
scraper_fixed.py		scraper_fixed.py
scraper_modern.py		scraper_modern.py
scraper_multisource.py		scraper_multisource.py
scraper_simple.py		scraper_simple.py
scraper_undetected.py		scraper_undetected.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crunchbase Scraper

Features

Installation

Usage

Command Line Scraping

Web Dashboard

Configuration

Adding More Companies

Custom Data Fields

Stealth Features

Troubleshooting

Common Issues

Adding Proxy Support

Data Output

Legal and Ethical Considerations

Development

Project Structure

Final Testing and Documentation

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Crunchbase Scraper

Features

Installation

Usage

Command Line Scraping

Web Dashboard

Configuration

Adding More Companies

Custom Data Fields

Stealth Features

Troubleshooting

Common Issues

Adding Proxy Support

Data Output

Legal and Ethical Considerations

Development

Project Structure

Final Testing and Documentation

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages