RSS driven article crawler and scraper.
To start using first install the requirements:
pip install -r base_requirements.txt
If for some reason JPype does not install try:
sudo apt-get install python-jpype
And then install the rest of the requirements:
pip install -r requirements.txt
Now all you have to do is insert a seed of RSS feeds into resources/rss.txt,
Enter the src folder and then simply run:
python webCrawler.py
Dependencies:
- Misja's python-boilerpipe (follow the installation instructions)
Will also installjpype&chardet
Based on Boilerpipe's HTML ArticleExtractor (scraper).