In the event that a task fails (due to internet blackout, temporary DNS failure), worker should send failed message to a DLQ instead of marking it as complete. Then, server should provide APIs to: list contents in DLQ, resume tasks from DLQ, resume particular task from DLQ, and drop tasks from DLQ.
The expectation is that DLQ should be used for things which qualify for future retry. It should NOT be used for things which cannot be retried, such as:
- image not being available
- not authorized/forbidden
Effectively, pure-connection-related issues right now qualify for DLQ.
In the event that a task fails (due to internet blackout, temporary DNS failure), worker should send failed message to a DLQ instead of marking it as complete. Then, server should provide APIs to: list contents in DLQ, resume tasks from DLQ, resume particular task from DLQ, and drop tasks from DLQ.
The expectation is that DLQ should be used for things which qualify for future retry. It should NOT be used for things which cannot be retried, such as:
Effectively, pure-connection-related issues right now qualify for DLQ.