A comprehensive tool for scraping company data from Crunchbase using browser automation to bypass protections like Cloudflare and bot detection.
- Browser Automation: Uses Selenium with Chrome WebDriver for stealthy scraping
- Data Extraction: Extracts company name, location, description, funding details, and investors
- Web Dashboard: Flask-based web interface for easy data viewing and scraping management
- JSON Storage: Saves scraped data to JSON files
- Logging: Comprehensive logging for debugging and monitoring
- Rate Limiting: Built-in delays to avoid detection
-
Install Python Dependencies:
pip install -r requirements.txt
-
Chrome WebDriver: The webdriver-manager will automatically download the appropriate ChromeDriver.
Run the scraper directly from the command line:
python scraper.pyThis will scrape the default company (OpenAI) and save data to company_data.json.
Start the web dashboard:
python dashboard.pyThen open your browser to http://localhost:5000 to:
- View scraped company data
- Enter new Crunchbase URLs to scrape
- Monitor scraping status
Edit the urls list in scraper.py:
urls = [
"https://crunchbase.com/organization/openai",
"https://crunchbase.com/organization/tesla",
"https://crunchbase.com/organization/microsoft",
# Add more URLs here
]Modify the data extraction section in scraper.py to add more fields:
# Add more data extraction
try:
website_element = driver.find_element(By.CSS_SELECTOR, '.website')
data['website'] = website_element.text.strip()
except:
data['website'] = 'N/A'The scraper includes several anti-detection measures:
- Headless Chrome operation
- Randomized user agents
- Disabled automation indicators
- Realistic browser fingerprints
- Rate limiting between requests
-
ChromeDriver Issues: The webdriver-manager should handle this automatically. If not, manually download ChromeDriver.
-
Cloudflare Blocking: If blocked, try:
- Increasing wait times
- Using different user agents
- Adding proxy support
-
Data Not Extracting: Crunchbase may have changed their page structure. Inspect the page and update CSS selectors.
To add proxy support, modify the Chrome options:
chrome_options.add_argument('--proxy-server=http://your-proxy:port')Scraped data is saved to company_data.json in the following format:
{
"company_name": "OpenAI",
"location": "San Francisco, California, United States",
"description": "OpenAI is an AI research and deployment company...",
"funding_details": "Series A - $1M, Series B - $10M...",
"investors": "Andreessen Horowitz, Sequoia Capital..."
}- Respect Crunchbase's Terms of Service
- Use reasonable request rates
- Consider the legal implications of web scraping
- This tool is for educational and research purposes only
crunchbase-scraper/
├── scraper.py # Main scraping script
├── scraper_advanced.py # Advanced scraping script with undetected-chromedriver
├── scraper_undetected.py # Alternative scraping script
├── dashboard.py # Web dashboard
├── requirements.txt # Python dependencies
├── company_data.json # Scraped data output
├── TODO.md # Development tasks
└── README.md # This file
- Run scraper.py to verify data extraction
- Run dashboard.py and test web interface functionality
- Verify JSON output format and data accuracy
- Update README.md with any final changes or improvements
- Mark final testing as complete
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
This project is for educational purposes. Please check local laws regarding web scraping.