Handle failed runs
This guide shows you how to gracefully handle failed workflow runs. This involves best practices on resolving runtime errors, logging and manually retrying runs that have failed multiple times.
Why a workflow might fail
- A step in your workflow throws a database error that causes your code to fail at runtime.
- QStash calls your workflow URL, but the URL is not reachable - for example, because of a temporary outage of your deployment platform.
- A single step takes longer than your platform’s function execution limit.
QStash automatically retries a failed step three times with exponential backoff to allow temporary outages to resolve.
A failed step is automatically retried three times
If, even after all retries, your step does not succeed, we’ll move the failed run into your Dead Letter Queue (DLQ). That way, you can always manually retry it again and debug the issue.
Manually retry from the Dead-Letter-Queue (DLQ)
If you want to take an action (a cleanup/log), you can configure either failureFunction
or a failureUrl
on the serve
method of your workflow.
These options allow you to define custom logic or an external endpoint that will be triggered when a failure occurs.
Using a failureFunction
(recommended)
This feature is not yet available in workflow-py. See our Roadmap for feature parity plans and Changelog for updates.
The serve
function you use to create a workflow endpoint accepts a failureFunction
parameter - an easy way to gracefully handle errors (i.e. logging them to Sentry) or your custom handling logic.
Note: If you use a custom authorization method to secure your workflow endpoint, add authorization to the failureFunction
too. Otherwise, anyone could invoke your failure function. Read more here: securing your workflow endpoint.
Using a failureUrl
This feature is not yet available in workflow-py. See our Roadmap for feature parity plans and Changelog for updates.
The failureUrl
handles cases where the service hosting your workflow URL is unavailable. In this case, a workflow failure notification is sent to another reachable endpoint.
Debugging failed runs
In your DLQ, filter messages via the Workflow URL
or Workflow Run ID
to search for a particular failure. We include all request and response headers and bodies to simplify debugging failed runs.
For example, let’s debug the following failed run. Judging by the status code 404
, the Ngrok-Error-Code
header of ERR_NGROK_3200
and the returned HTML body, we know that the URL our workflow called does not exist.
Was this page helpful?