Skip to content

Conversation

@further-reading
Copy link

Updating the scrapy price monitor example project as part of ecommerce content review.

Some notable changes from the original:

  • Only scrapes books.toscrape.com to prevent need to update site specific scrapers
  • For interacting with collections I used my collection helper to simplify it
  • I use an ItemLoader for handling the item to allow for addign standard processors
  • I use price-parser to handle extraction of the price from string

Old version is still present in the _scrapy_price_monitor_OLD directory.

@further-reading further-reading self-assigned this Nov 10, 2021

class CollectionHelper:
"""Adapter to make interacting with scraping collection easier"""
def __init__(self, proj_id, collection_name, api_key=None, create=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only problem I see with this is that it will force people to use our Collections, I'm not sure if it's needed, maybe we can just leave it out? It will make code easier to use by newbies. If someone new to Scrapy will start using it without apikey he will get errors, he might not know what to do to tix it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The project is intended to run on scrapy cloud, which will default to its enviornment apikey when api_key is None. The alternative would require setting up persistance to keep some sort of database which is far more complex.

for name, urls in products.items():
for url in urls:
if self.name in url:
now = datetime.now().strftime('%Y/%m/%d %H:%M:%S')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just use isoformat here

datetime.now().isoformat()
'2021-11-19T09:30:27.341132'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually why is there need to pass item in meta? just for showing how to pass in meta? product name is already available on product page, retailer is static and now can also be taken from product page.

Output is somewhat confusing

{'product_name': 'soumission', 'retailer': 'books.toscrape.com', 'when': '2021/11/19 09:23:19', 'url': 'https://books.toscrape.com/catalogue/soumission_998/index.html', 'name': 'Soumission', 'price': 50.1}

so there is product name twice

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible that the name on the website may not exactly be the name on the json. The json name seems to be more of a label to indicate what product the urls should be linking to.

AWS_ACCESS_KEY = os.getenv('$AWS_ACCESS_KEY')
AWS_SECRET_KEY = os.getenv('$AWS_SECRET_KEY')
EMAIL_ALERT_FROM = 'Price Monitor <[email protected]>'
EMAIL_ALERT_TO = ['[email protected]']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tried if e-mail sending actually works?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I don't personally have an AWS email service set up for this so I wasn't able to test if this part still works.

@pawelmhm
Copy link
Member

In my opinion we could skip e-mail sending, as this is extra configuration step and it's good to keep things easy for newcomers. We can just print somethign in bin/monitor without sending e-mail and add comment asking users to configure somethign, they can add slack integration, e-mail whaterver.

@further-reading
Copy link
Author

further-reading commented Nov 19, 2021

In my opinion we could skip e-mail sending, as this is extra configuration step and it's good to keep things easy for newcomers. We can just print somethign in bin/monitor without sending e-mail and add comment asking users to configure somethign, they can add slack integration, e-mail whaterver.

I'm not certain if the intended audience of the post are total newcomers. One of the main selling points of the blog appears to be the email step so that the user gets passive notifications and I'm not sure if it's still a valuable sample if it defaults to a log you have to manually check every 30 minutes.

I wonder if maybe a disclaimer explaining that this notification script is using an external service that can be changed and the user could write an alternative would be better?

Update: I updated the alert step to seperate the ASN relevant code to make it easier for a user to update and/or replace this content for their preferred alert approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants