Save crawl file in a cloud bucket #396

antoineeripret · 2024-12-05T13:41:40Z

antoineeripret
Dec 5, 2024

It's me again with a new idea.

Context

I'm trying to create a crawl logic (including your library) through Airflow. It works great, **the issue is the amount of data created by the library in the local storage. ** I'm trying to rely as little as possible because the amount of data could be quite important in the use case I'm working on at the moment.

Idea

What I attempt to implement is the following:

Launch the crawl
Every 1,000 URLs (for instance), stop the crawl, upload the JL file to a cloud bucket and empty it
Resume the crawl
....

By doing so, the file would be kept to a reduced size, while the output of Advertools won't be affected. The issue is that I'm not sure if there is a way of "stopping" the crawl without losing the queue.

Do you happen to have an idea?

Thank you!

eliasdabbas · 2024-12-05T17:52:28Z

eliasdabbas
Dec 5, 2024
Maintainer

Thanks again @antoineeripret !

Cloud storage: this is already supported by scrapy, and you should be able to use it when crawling with advertools. Under custom_settings you can use AWC and GCS, and these have dedicated settings or them. Please check out the scrapy setting page for the details.
To start, pause and resume crawling you can use the JOBDIR custom setting as well. This keeps track of the pages that were crawled already, and avoids crawling them. All you have to do is set this once, and then crawl again using this same setting. There is a tip here if you want. If the first option works, I don't think you'll need to use the second one.

Let me know if that works, and how it goes.

Here's another interesting case study :)

5 replies

antoineeripret Dec 6, 2024
Author

Hey @eliasdabbas,

Awesome, great news ! I'll have a look next week and keep you posted !

I didn't forget the article I need to write btw, but I want to finalize what I'm doing with your library first to cover all possible cases :)

Thank you !

eliasdabbas Dec 6, 2024
Maintainer

Great!

Let me know how it goes (on all fronts :)

Have a great weekend!

antoineeripret Dec 9, 2024
Author

Sure, thank you for your help !

antoineeripret Dec 13, 2024
Author

Hey @eliasdabbas,

I've just tested the streaming to S3, and it works like a charm. Ridiculous how easy it is to set up.

Thanks again for your help !

eliasdabbas Dec 13, 2024
Maintainer

Wonderful to hear this!

Thanks for letting me know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Save crawl file in a cloud bucket #396

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Save crawl file in a cloud bucket #396

Uh oh!

antoineeripret Dec 5, 2024

Context

Idea

Replies: 1 comment · 5 replies

Uh oh!

eliasdabbas Dec 5, 2024 Maintainer

Uh oh!

antoineeripret Dec 6, 2024 Author

Uh oh!

eliasdabbas Dec 6, 2024 Maintainer

Uh oh!

antoineeripret Dec 9, 2024 Author

Uh oh!

antoineeripret Dec 13, 2024 Author

Uh oh!

eliasdabbas Dec 13, 2024 Maintainer

antoineeripret
Dec 5, 2024

Replies: 1 comment 5 replies

eliasdabbas
Dec 5, 2024
Maintainer

antoineeripret Dec 6, 2024
Author

eliasdabbas Dec 6, 2024
Maintainer

antoineeripret Dec 9, 2024
Author

antoineeripret Dec 13, 2024
Author

eliasdabbas Dec 13, 2024
Maintainer