You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- support uploading WACZ to s3-compatible storage (via minio client)
- config storage loaded from env vars, enabled when WACZ output is used.
- support pinging either or an http or a redis key-based webhook,
- webhook: include 'completed' bool to indicate if fully completed crawl or partial (eg. interrupted via signal)
- consolidate redis init to redis.js
- support upload filename with custom variables: can interpolate current timestamp (@ts), hostname (@hostname) and user provided id (@Crawlid)
- README: add docs for s3 storage, remove unused args
- update to pywb 2.6.2, browsertrix-behaviors 0.2.4
* fix to `limit` option, ensure limit check uses shared state
* bump version to 0.5.0-beta.1
will start a crawl with 3 workers, and show the screen of each of the workers from `http://localhost:9037/`.
407
407
408
+
### Uploading crawl output to S3-Compatible Storage
409
+
410
+
Browsertrix Crawler also includes support for uploading WACZ files to S3-compatible storage, and notifying a webhook when the upload succeeds.
411
+
412
+
(At this time, S3 upload is supported only when WACZ output is enabled, but WARC uploads may be added in the future).
413
+
414
+
This feature can currently be enabled by setting environment variables (for security reasons, these settings are not passed in as part of the command-line or YAML config at this time).
415
+
416
+
<details>
417
+
418
+
<summary>Environment variables for S3-uploads include:</summary>
-`STORE_PATH` - optional path appended to endpoint, if provided
423
+
-`STORE_FILENAME` - filename or template for filename to put on S3
424
+
-`STORE_USER` - optional username to pass back as part of the webhook callback
425
+
-`CRAWL_ID` - unique crawl id (defaults to container hostname)
426
+
-`WEBHOOK_URL` - the URL of the webhook (can be http://, https:// or redis://)
427
+
428
+
</details>
429
+
430
+
#### Webhook Notification
431
+
432
+
The webhook URL can be an HTTP URL which receives a JSON POST request OR a Redis URL, which specifies a redis list key to which the JSON data is pushed as a string.
0 commit comments