Skip to content

Wangombe550/scrapping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crunchbase Scraper

A comprehensive tool for scraping company data from Crunchbase using browser automation to bypass protections like Cloudflare and bot detection.

Features

  • Browser Automation: Uses Selenium with Chrome WebDriver for stealthy scraping
  • Data Extraction: Extracts company name, location, description, funding details, and investors
  • Web Dashboard: Flask-based web interface for easy data viewing and scraping management
  • JSON Storage: Saves scraped data to JSON files
  • Logging: Comprehensive logging for debugging and monitoring
  • Rate Limiting: Built-in delays to avoid detection

Installation

  1. Install Python Dependencies:

    pip install -r requirements.txt
  2. Chrome WebDriver: The webdriver-manager will automatically download the appropriate ChromeDriver.

Usage

Command Line Scraping

Run the scraper directly from the command line:

python scraper.py

This will scrape the default company (OpenAI) and save data to company_data.json.

Web Dashboard

Start the web dashboard:

python dashboard.py

Then open your browser to http://localhost:5000 to:

  • View scraped company data
  • Enter new Crunchbase URLs to scrape
  • Monitor scraping status

Configuration

Adding More Companies

Edit the urls list in scraper.py:

urls = [
    "https://crunchbase.com/organization/openai",
    "https://crunchbase.com/organization/tesla",
    "https://crunchbase.com/organization/microsoft",
    # Add more URLs here
]

Custom Data Fields

Modify the data extraction section in scraper.py to add more fields:

# Add more data extraction
try:
    website_element = driver.find_element(By.CSS_SELECTOR, '.website')
    data['website'] = website_element.text.strip()
except:
    data['website'] = 'N/A'

Stealth Features

The scraper includes several anti-detection measures:

  • Headless Chrome operation
  • Randomized user agents
  • Disabled automation indicators
  • Realistic browser fingerprints
  • Rate limiting between requests

Troubleshooting

Common Issues

  1. ChromeDriver Issues: The webdriver-manager should handle this automatically. If not, manually download ChromeDriver.

  2. Cloudflare Blocking: If blocked, try:

    • Increasing wait times
    • Using different user agents
    • Adding proxy support
  3. Data Not Extracting: Crunchbase may have changed their page structure. Inspect the page and update CSS selectors.

Adding Proxy Support

To add proxy support, modify the Chrome options:

chrome_options.add_argument('--proxy-server=http://your-proxy:port')

Data Output

Scraped data is saved to company_data.json in the following format:

{
    "company_name": "OpenAI",
    "location": "San Francisco, California, United States",
    "description": "OpenAI is an AI research and deployment company...",
    "funding_details": "Series A - $1M, Series B - $10M...",
    "investors": "Andreessen Horowitz, Sequoia Capital..."
}

Legal and Ethical Considerations

  • Respect Crunchbase's Terms of Service
  • Use reasonable request rates
  • Consider the legal implications of web scraping
  • This tool is for educational and research purposes only

Development

Project Structure

crunchbase-scraper/
├── scraper.py          # Main scraping script
├── scraper_advanced.py # Advanced scraping script with undetected-chromedriver
├── scraper_undetected.py # Alternative scraping script
├── dashboard.py        # Web dashboard
├── requirements.txt    # Python dependencies
├── company_data.json   # Scraped data output
├── TODO.md            # Development tasks
└── README.md          # This file

Final Testing and Documentation

  • Run scraper.py to verify data extraction
  • Run dashboard.py and test web interface functionality
  • Verify JSON output format and data accuracy
  • Update README.md with any final changes or improvements
  • Mark final testing as complete

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

License

This project is for educational purposes. Please check local laws regarding web scraping.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors