November 08, 2022

Failure Handling in Practice

Fitz

Fitz

Imagine you're at a café trying to order a sandwich. Most of the time everything goes fine and you get a delicious meal. And sometimes, things don't go so well. Maybe they don't make your favorite sandwich, or maybe they've run out of cheese, or maybe you've shown up right at the beginning of the lunch rush.

Some of these failures are easier to deal with than others. Some of them you have to deal with (change your order, wait for a while, or decide it's too long and leave). Others the restaurant will need to deal with (bring in more line cooks, grab the backup cheese from the fridge, or go buy more).

It's no different for software, and with Temporal, we aim to make handling many types of failure easy.

My colleague Dominik Tornow recently outlined the theoretical and formal basis of failures and what to do about them. Fundamentally, he points out, failure has two dimensions: where it happens and when. Both dimensions have an element of what: Retry the failure or not? Compensate or not? But also how? How do we handle failures once we know how to classify them?

Let's consider a process for handling an online order with three major steps:

  1. Validate inventory (and possibly reserve the item(s))
  2. Charge customer
  3. Fulfill order

Dominik focused on a couple of potential failures that might crop up in step two (charging the customer's credit card). I'll focus on the same, but in the context of a Temporal Workflow. (For the rest of this post, I'll assume you have at least a passing familiarity with Temporal's core concepts.)

With Temporal, you'd implement this three step process using a Temporal Workflow that executes three separate Activities. Here's an example of this Workflow in Go:

package app

import (
	"time"
	"go.temporal.io/sdk/workflow"
)

type OrderInfo struct { /* ... */ }

func Workflow(ctx workflow.Context, order OrderInfo) error {
	activityoptions := workflow.ActivityOptions{
		StartToCloseTimeout: 5 * time.Second,
	}
	ctx = workflow.WithActivityOptions(ctx, activityoptions)

	var a *Activities

	err := workflow.ExecuteActivity(ctx, a.CheckInventory, order).Get(ctx, nil)
	if err != nil {
		return err
	}

	err = workflow.ExecuteActivity(ctx, a.Charge, order).Get(ctx, nil)
	if err != nil {
		return err
	}

	err = workflow.ExecuteActivity(ctx, a.FulfillOrder, order).Get(ctx, nil)
	if err != nil {
		return err
	}

	return nil
}

That is, one activity to represent each of the major steps. But, of course, failures occur! What happens if one of these stages fails? Is that StartToCloseTimeout hanging out at the top of the function implying a retry after five seconds? Surely that doesn't cover everything that could possibly go wrong... What ever will we do?!?

Where: Platform vs. Application

In the "where" dimension, what Dominik termed spatial, the Charge activity has two potential errors: InsufficientFundsError and ConnectionError.

One way to think about how to classify these is to ask yourself: Can the error, in most cases, be handled entirely by something low-level? (Think: TCP retransmission.) If yes, then it's most likely a platform-level failure.

From another perspective, you could ask yourself: Does handling the error require scenario or domain-specific logic? (Think: changing your sandwich order; the restaurant, aka platform, doesn't know what you want.) If yes, then it's most likely an application-level failure.

For the two example errors, nothing about the request itself—not which account we're charging, or how much money to charge—would stop ConnectionError from happening. That makes it platform-level. In Temporal, you can configure the platform to handle errors like this automatically:

	retrypolicy := &temporal.RetryPolicy{
		InitialInterval:        time.Second,
		BackoffCoefficient:     1.0,
		NonRetryableErrorTypes: []string{"InsufficientFundsError"},
	}

	activityoptions := workflow.ActivityOptions{
		RetryPolicy: retrypolicy,
	}
	ctx = workflow.WithActivityOptions(ctx, activityoptions)
	err = workflow.ExecuteActivity(ctx,
		a.Charge,
		order.Id,
		order.TotalCost).Get(ctx, nil)

But don't do this! A linear retry policy like this isn't great. I'll be working up to an exponential backoff in the next section.

After the first failure, the InitialInterval tells Temporal to wait one second before retrying. If the Charge activity fails again, the wait time will be that interval multiplied by the BackoffCoefficient for every subsequent attempt. As such, given retry attempt number N, we'll calculate the delay to be:

Backoff Equation

(That is, InitialInterval times BackoffCoefficient raised to the Nth power.)

So, with both the coefficient and the initial interval set to 1.0, this tells Temporal to retry a failed activity every second. This works great for most cases of ConnectionError. If there's a network blip, chances are good that the blip will be done after a single second.

I've listed InsufficientFundsError in NonRetryableErrorTypes because it's a totally different story. What would happen if that error was retried after a second? What are the chances the customer's account actually has enough money now? Pretty low unless the activity execution caught them on payday at the exact moment the check was being deposited.

That makes InsufficientFundsError an application-level failure. Retrying it almost certainly won't make things better. You'll have to resort to business logic to handle this error, likely backtracking to display an error message to the customer and if the item(s) were reserved in inventory, release them back to the available pool.

Additionally, because the strategy for handling InsufficientFunds errors involves undoing some work, you'd say that it's Backward Recovery.

The handling strategy I've used here for ConnectionError is, well, mostly Forward Recovery, because if a retry succeeds, we can just continue on without undoing anything. But it's also ambiguous.

When: temporal (but not Temporal) failures

Does retrying after a ConnectionError every second for_ eternity_ make sense? Like all things, it depends. (Spoiler alert: it probably doesn't make sense.)

Dominik explained three kinds of failures along the when dimension (that is, their characteristics change over time): Transient, Intermittent, and Permanent.

Transient errors are almost the easiest to deal with because they tend to go away quickly. Momentary network congestion, a router restarting, a new server version rolling out mid-request... These will resolve themselves and so the Retry Policy I wrote above, with no backoff, will be fine.

But if you consider that 1.0 constant, you'll end up in the realm of Intermittent failures pretty quickly. Like transient errors, intermittent errors will likely resolve themselves, but it's going to take a moment; immediately retrying is probably going to give you the same error. Think rate limits, power outages, unplugged network cables. The fix? Actually back off on those retries and retry them with ever increasing waits:

	retrypolicy := &temporal.RetryPolicy{
		InitialInterval:        time.Second,
		BackoffCoefficient:     2.0,
		NonRetryableErrorTypes: []string{"InsufficientFundsError"},
	}

Recall the retry formula above. By setting BackoffCoefficient to something greater than 1, you've now got an exponential backoff on the retries with no additional code.

Transient and Intermittent failures are closely related. Generally speaking, you can expect that a retry will eventually succeed. When it does, this would be a Forward Recovery.

The line between those and a Permanent failure is often hard to find. In some ways, the application-level error, InsufficientFundsError, is a permanent error because without intervention – someone depositing money in the account or switching to a different card – you can't expect this error will resolve itself.

Likewise, ConnectionError can be permanent. In some cases, you might be able to detect these sub-errors explicitly – authentication failure, 404. If that's the case, just add them to the list of non-retryable types.

But trying to exhaustively list all known permanent, non-retryable errors would be exhausting. Instead, you have two mitigation strategies you can use.

First, set a couple of maximums on the retry policy, so that no matter what, the workflow won't just sit there spinning on something that'll never work. (Depending on the API, this could be expensive: retrying a call to Google's Nearby Places API every second for an hour, for example, would cost $115.20.)

	retrypolicy := &temporal.RetryPolicy{
		InitialInterval:        time.Second,
		BackoffCoefficient:     2.0,
		MaximumInterval:        time.Second * 120,
		MaximumAttempts:        50,
		NonRetryableErrorTypes: []string{"InsufficientFundsError"},
	}

Then, consider modifying the activity itself to use NewNonRetryableApplicationError from the Temporal SDK. Roughly:

func Workflow(ctx workflow.Context, order OrderInfo) error {
	/* ... */

	retrypolicy := &temporal.RetryPolicy{
		// ...
		NonRetryableErrorTypes: []string{"InsufficientFundsError", "AuthFailure"},
	}

	/* ... */
}

func Charge(ctx context.Context, order OrderInfo) error {
	/* ... attempt to charge the credit card ... */

	var connErr *ConnectionError
	if err != nil && errors.As(err, &connErr) {
		// Without manual intervention, a 403 - Forbidden is going to keep happening. So
		// instead of returning the plain ConnectionError (which by default would be retried)
		// we'll wrap it in a NonRetryable to prevent that from needlessly happening.
		if connErr.code == 403 {
			return temporal.NewNonRetryableApplicationError(
				connErr.message,
				"ConnectionError",
				err,
			)
		}
	}

	// May be nil or not. If not, let the caller (i.e., workflow) decide if
	// it's retryable by listing non-retryables in the retry policy and handling appropriately.
	return err
}

From there, if the workflow receives one of those non-retryable errors or exceeds the retry attempts you'll probably want to "recover backward" and undo anything critical that the process had committed (such as reserving inventory that the customer can't actually pay for).

In pseudocode, such a rewrite of the workflow might look like the following:

const (
	STAGE_INVENTORY = 0
	STAGE_CHARGE    = 1
	STAGE_FULFILL   = 2
)

err := ExecuteActivity(CheckInventory)
if err != nil {
	recover(STAGE_INVENTORY)
	return err
}

err = ExecuteActivity(Charge)
if err != nil {
	recover(STAGE_CHARGE)
	return err
}

err = ExecuteActivity(FulfillOrder)
if err != nil {
	recover(STAGE_FULFILL)
	return err
}

func recover(stage int) {
	switch stage {
	default:
		// For unknown stages of the process, attempt to undo the whole thing.
		fallthrough
	case STAGE_FULFILL:
		// Kick off undoing the entire order
		ExecuteActivity(ReturnToWarehouse)
		fallthrough
	case STAGE_CHARGE:
		// Given a transaction ID, ensure the user isn't charged.
		ExecuteActivity(ReverseCharge)
		fallthrough
	case STAGE_INVENTORY:
		ExecuteActivity(ReleaseInventory)
		fallthrough
	case -1:
		ExecuteActivity(NotifyCustomerFailure)
	}
}

Wrapping up

Errors and failures don't have to be scary. Thinking about where the error occurs and how long it might last will give you a strong framework for deciding what to do about it—either failing forward (retry until success) or backtracking (cease trying, undo actions as necessary, and report failure).

Temporal helps you deal easily and seamlessly with a wide range of failure modes. You'll still have to figure out how to nicely tell the user your API key expired, but at least now you'll know you handled it the best you could.

If you want to see the complete code used in this post, including some simulations of the various failure scenarios, it's up in this GitHub repo.