A web crawler built with Scrapy framework to collect job listings from GradConnection Australia. This scraper specifically targets computer science jobs but can be easily customized to scrape jobs for any field.
- 🔍 Scrapes job listings from GradConnection Australia
- 📊 Extracts key job information: title, company, type, location, deadline, and link
- 🔧 Easily customizable for different job categories
- 📄 Exports data to various formats (JSON, CSV, XML)
- ⚡ Efficient pagination handling for complete data collection
job/
├── spiders/
│ ├── gradconnection.py # Main spider for GradConnection
│ └── prosple.py # Additional spider for Prosple
├── items.py # Data structure definitions
├── pipelines.py # Data processing pipelines
├── settings.py # Scrapy configuration
├── middlewares.py # Custom middlewares
└── README.md # This file
The scraper collects the following information for each job:
- Title: Job position title
- Company: Employer name
- Type: Job type (full-time, part-time, internship, etc.)
- Location: Job location
- Deadline: Application deadline
- Link: Direct link to the job posting
- Python 3.7+
- Scrapy framework
- Clone this repository:
git clone <your-repository-url>
cd job- Install required dependencies:
pip install scrapyRun the spider to scrape computer science jobs:
scrapy crawl gradconnectionExport scraped data to JSON:
scrapy crawl gradconnection -o jobs.jsonExport to CSV:
scrapy crawl gradconnection -o jobs.csvExport to XML:
scrapy crawl gradconnection -o jobs.xmlTo scrape jobs for a different field, modify the start_urls in spiders/gradconnection.py:
# Current URL for computer science jobs
start_urls = ["https://au.gradconnection.com/jobs/computer-science/"]
# Example: Change to engineering jobs
start_urls = ["https://au.gradconnection.com/jobs/engineering/"]
# Example: Change to business jobs
start_urls = ["https://au.gradconnection.com/jobs/business/"]Common job categories on GradConnection include:
computer-scienceengineeringbusinessfinancemarketinglawsciencehealthcare
To add or modify the data fields being scraped, update the JobItem class in items.py:
class JobItem(scrapy.Item):
title = scrapy.Field()
company = scrapy.Field()
type = scrapy.Field()
location = scrapy.Field()
deadline = scrapy.Field()
link = scrapy.Field()
# Add new fields here
salary = scrapy.Field()
description = scrapy.Field()Then update the corresponding extraction logic in spiders/gradconnection.py.
Key settings can be modified in settings.py:
CONCURRENT_REQUESTS: Number of concurrent requests (default: 8)ROBOTSTXT_OBEY: Whether to obey robots.txt (default: False)ITEM_PIPELINES: Data processing pipelines
To be respectful to the target website, consider adding delays between requests:
# In settings.py
DOWNLOAD_DELAY = 2 # 2 seconds delay between requests
RANDOMIZE_DOWNLOAD_DELAY = 0.5 # 0.5 * to 1.5 * DOWNLOAD_DELAYThe scraper outputs data in the following format:
{
"title": "Software Engineer Graduate",
"company": "Tech Company",
"type": "Graduate Program",
"location": "Sydney, NSW",
"deadline": "Closing in 15 days",
"link": "https://au.gradconnection.com/jobs/12345/"
}Here's an example of the actual scraped computer science job data:
Example output showing scraped computer science job listings from GradConnection, including job titles, companies, locations, deadlines, and direct links to job postings.
- 🤖 Always respect the website's robots.txt file
- ⏱️ Implement appropriate delays between requests
- 📊 Use scraped data responsibly and in accordance with the website's terms of service
- 🔒 Do not overload the target server with excessive requests
- No data scraped: Check if the website structure has changed
- Rate limiting: Increase
DOWNLOAD_DELAYin settings - Connection errors: Check your internet connection and target website availability
Run the spider in debug mode for troubleshooting:
scrapy crawl gradconnection -L DEBUG- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
This project is for educational purposes. Please ensure compliance with GradConnection's terms of service and robots.txt before use.
This scraper is designed for educational and research purposes. Users are responsible for ensuring their use complies with the target website's terms of service and applicable laws.
