January 20, 2023

Building Reliable Distributed Systems in Node.js

Loren Sands-Ramshaw

Loren Sands-Ramshaw

This post introduces the concept of durable execution, which is used by Stripe, Netflix, Coinbase, HashiCorp, and many others to solve a wide range of problems in distributed systems. Then it shows how simple it is to write durable code using Temporal's TypeScript/JavaScript SDK.

For an updated version of this post, see durable-execution.pdf

Distributed systems

When building a request-response monolith backed by a single database that supports transactions, we don’t have many distributed systems concerns. We can have simple failure modes and easily maintain accurate state:

  • If the client can’t reach the server, the client retries.
  • If the client reaches the server, but the server can’t reach the database, the server responds with an error, and the client retries.
  • If the server reaches the database, but the transaction fails, the server responds with an error, and the client retries.
  • If the transaction succeeds but the server goes down before responding to the client, the client retries until the server is back up, and the transaction fails the second time (assuming the transaction has some check–like an idempotency token–to tell whether the update has already been applied), and the server reports to the client that the action has already been performed.

As soon as we introduce a second place for state to live, whether that’s a service with its own database or an external API, handling failures and maintaining consistency (accuracy across all data stores) gets significantly more complex. For example, if our server has to charge a credit card and also update the database, we can no longer write simple code like:

function handleRequest() {
  paymentAPI.chargeCard()
  database.insertOrder()
  return 200
}

If the first step (charging the card) succeeds, but the second step (adding the order to the database) fails, then the system ends up in an inconsistent state; we charged their card, but there’s no record of it in our database. To try to maintain consistency, we might have the second step retry until we can reach the database. However, it’s also possible that the process running our code will fail, in which case we’ll have no knowledge that the first step took place. To fix this, we need to do three things:

  • Persist the order details
  • Persist which steps of the program we’ve completed
  • Run a worker process that checks the database for incomplete orders and continues with the next step

That, along with persisting retry state and adding timeouts for each step, is a lot of code to write, and it’s easy to miss certain edge cases or failure modes (see the full, scalable architecture). We could build things faster and more reliably if we didn’t have to write and debug all that code. And we don’t have to, because we can use durable execution.

Durable execution

Durable execution systems run our code in a way that persists each step the code takes. If the process or container running the code dies, the code automatically continues running in another process with all state intact, including call stack and local variables.

Durable execution ensures that the code is executed to completion, no matter how reliable the hardware or how long downstream services are offline. Retries and timeouts are performed automatically, and resources are freed up when the code isn’t doing anything (for example while waiting on a sleep(‘30 days’) statement).

Durable execution makes it trivial or unnecessary to implement distributed systems patterns like event-driven architecture, task queues, sagas, circuit breakers, and transactional outboxes. It’s programming on a higher level of abstraction, where you don’t have to be concerned about transient failures like server crashes or network issues. It opens up new possibilities like:

  • Storing state in local variables instead of a database, because local variables are automatically stored for us
  • Writing code that sleeps for a month, because we don’t need to be concerned about the process that started the sleep still being there next month, or resources being tied up for the duration
  • Functions that can run forever, and that we can interact with (send commands to or query data from)

Some examples of durable execution systems are Azure Durable Functions, Amazon SWF, Uber Cadence, Infinitic, and Temporal (where I work). At the risk of being less than perfectly objective, I think Temporal is the best of these options 😊.

Durable JavaScript

Now that we’ve gone over consistency in distributed systems and what durable execution is, let’s look at a practical example. I built this food delivery app to show what durable code looks like and what problems it solves:

temporal.menu

Durable Delivery app menu

Don’t blame me for the logo—that’s just what Stable Diffusion gives you when you ask it for a durable delivery app logo. 🤷😄

The app has four main pieces of functionality:

  • Create an order and charge the customer
  • Get order status
  • Mark an order picked up
  • Mark an order delivered

The order process, showing both the menu and driver sites

When we order an item from the menu, it appears in the delivery driver site (drive.temporal.menu), and the driver can mark the order as picked up, and then as delivered.

All of this functionality can be implemented in a single function of durable JavaScript or TypeScript. We’ll be using the latter—I recommend TypeScript, and our library is named the TypeScript SDK, but it’s published to npm as JavaScript and can be used in any Node.js project.

Create an order

Let’s take a look at the code for this app. We’ll see a few API routes but mostly go over each piece of the single durable function named order. If you’d like to run the app or view the code on your machine, this will download and set up the project:

npx @temporalio/create@latest --sample food-delivery

You can also read through the code on GitHub.

When the user clicks the order button, the React frontend calls the createOrder mutation defined by the tRPC backend. The createOrder API route handler creates the order by starting a durable order function. Durable functions—called Workflows—are started using a Client instance from @temporalio/client, which has been added to the tRPC context under ctx.temporal. The route handler receives a validated input (an object with a productId number and orderId string) and it calls ctx.temporal.workflow.start to start an order Workflow, providing input.productId as an argument:

apps/menu/pages/api/[trpc].ts

import { initTRPC } from '@trpc/server'
import { z } from 'zod'
import { taskQueue } from 'common'
import { Context } from 'common/trpc-context'
import { order } from 'workflows'

const t = initTRPC.context<Context>().create()

export const appRouter = t.router({
  createOrder: t.procedure
    .input(z.object({ productId: z.number(), orderId: z.string() }))
    .mutation(async ({ input, ctx }) => {
      await ctx.temporal.workflow.start(order, {
        workflowId: input.orderId,
        args: [input.productId],
        taskQueue,
      })

      return 'Order received and persisted!'
    }),

The order function starts out validating the input, setting up the initial state, and charging the customer:

packages/workflows/order.ts

type OrderState = 'Charging card' | 'Paid' | 'Picked up' | 'Delivered' | 'Refunding'

export async function order(productId: number): Promise<void> {
  const product = getProductById(productId)
  if (!product) {
    throw ApplicationFailure.create({ message: `Product ${productId} not found` })
  }

  let state: OrderState = 'Charging card'
  let deliveredAt: Date

  try {
    await chargeCustomer(product)
  } catch (err) {
    const message = `Failed to charge customer for ${product.name}. Error: ${errorMessage(err)}`
    await sendPushNotification(message)
    throw ApplicationFailure.create({ message })
  }

  state = 'Paid'

Any functions that might fail are automatically retried. In this case, chargeCustomer and sendPushNotification both talk to services that might be down at the moment or might return transient error messages like “Temporarily unavailable.” Temporal will automatically retry running these functions (by default indefinitely with exponential backoff, but that’s configurable). The functions can also throw non-retryable errors like “Card declined,” in which case they won’t be retried. Instead, the error will be thrown out of chargeCustomer(product) and caught by the catch block; the customer receives a notification that their payment method failed, and we throw an ApplicationFailure to fail the order Workflow.

Get order status

The next bit of code requires some background: Normal functions can’t run for a long time, because they’ll take up resources while they’re waiting for things to happen, and at some point they’ll die when we deploy new code and the old containers get shut down. Durable functions can run for an arbitrary length of time for two reasons:

  • They don’t take up resources when they’re waiting on something.
  • It doesn’t matter if the process running them gets shut down, because execution will seamlessly be continued by another process.

So although some durable functions run for a short period of time—like a successful money transfer function—some run longer—like our order function, which ends when the order is delivered, and a customer function that lasts for the lifetime of the customer.

It’s useful to be able to interact with long-running functions, so Temporal provides what we call Signals for sending data into the function and Queries for getting data out of the function. The driver site shows the status of each order by sending Queries to the order functions through this API route:

apps/menu/pages/api/[trpc].ts

  getOrderStatus: t.procedure
    .input(z.string())
    .query(({ input: orderId, ctx }) => ctx.temporal.workflow.getHandle(orderId).query(getStatusQuery)),

It gets a handle to the specific instance of the order function (called a Workflow Execution), sends the getStatusQuery, and returns the result. The getStatusQuery is defined in the order file and handled in the order function:

packages/workflows/order.ts

import { defineQuery, setHandler } from '@temporalio/workflow'

export const getStatusQuery = defineQuery<OrderStatus>('getStatus')

export async function order(productId: number): Promise<void> {
  let state: OrderState = 'Charging card'
  let deliveredAt: Date

  // …

  setHandler(getStatusQuery, () => {
    return { state, deliveredAt, productId }
  })

When the order function receives the getStatusQuery, the function passed to setHandler is called, which returns the values of local variables. After the call to chargeCustomer succeeds, the state is changed to ’Paid’, and the driver site, which has been polling getStatusQuery, gets the updated state. It displays the “Pick up” button.

Picking up an order

When the driver taps the button to mark the order as picked up, the site sends a pickUp mutation to the API server, which sends a pickedUpSignal to the order function:

apps/driver/pages/api/[trpc].ts

  pickUp: t.procedure
    .input(z.string())
    .mutation(async ({ input: orderId, ctx }) => 
      ctx.temporal.workflow.getHandle(orderId).signal(pickedUpSignal)
    ),

The order function handles the Signal by updating the state:

packages/workflows/order.ts

export const pickedUpSignal = defineSignal('pickedUp')

export async function order(productId: number): Promise<void> {
  // …

  setHandler(pickedUpSignal, () => {
    if (state === 'Paid') {
      state = 'Picked up'
    }
  })

Meanwhile, further down in the function, after the customer was charged, the function has been waiting for the pickup to happen:

packages/workflows/order.ts

import { condition } from '@temporalio/workflow'

export async function order(productId: number): Promise<void> {
  // …

  try {
    await chargeCustomer(product)
  } catch (err) {
    // …
  }

  state = 'Paid'

  const notPickedUpInTime = !(await condition(() => state === 'Picked up', '1 min'))
  if (notPickedUpInTime) {
    state = 'Refunding'
    await refundAndNotify(
      product,
      '⚠️ No drivers were available to pick up your order. Your payment has been refunded.'
    )
    throw ApplicationFailure.create({ message: 'Not picked up in time' })
  }

await condition(() => state === 'Picked up', '1 min') waits for up to 1 minute for the state to change to Picked up. If a minute goes by without it changing, it returns false, and we refund the customer. (Either we have very high standards for the speed of our chefs and delivery drivers, or we want the users of a demo app to be able to see all the failure modes 😄.)

Delivery

Similarly, there’s a deliveredSignal sent by the “Deliver” button, and if the driver doesn’t complete delivery within a minute of pickup, the customer is refunded.

packages/workflows/order.ts

export const deliveredSignal = defineSignal('delivered')

export async function order(productId: number): Promise<void> {
  setHandler(deliveredSignal, () => {
    if (state === 'Picked up') {
      state = 'Delivered'
      deliveredAt = new Date()
    }
  })

  // …

  await sendPushNotification('🚗 Order picked up')

  const notDeliveredInTime = !(await condition(() => state === 'Delivered', '1 min'))
  if (notDeliveredInTime) {
    state = 'Refunding'
    await refundAndNotify(product, '⚠️ Your driver was unable to deliver your order. Your payment has been refunded.')
    throw ApplicationFailure.create({ message: 'Not delivered in time' })
  }

  await sendPushNotification('✅ Order delivered!')

If delivery was successful, the function waits for a minute for the customer to eat their meal and asks them to rate their experience.

  await sleep('1 min') // this could also be hours or even months

  await sendPushNotification(`✍️ Rate your meal. How was the ${product.name.toLowerCase()}?`)
}

After the final push notification, the order function’s execution ends, and the Workflow Execution completes successfully. Even though the function has completed, we can still send Queries, since Temporal has the final state of the function saved. And we can test that by refreshing the page a minute after an order has been delivered: the getStatusQuery still works and “Delivered” is shown as the status:

Poke order with Status: Delivered

Summary

We’ve seen how a multi-step order flow can be implemented with a single durable function. The function is guaranteed to complete in the presence of failures, including:

  • Temporary issues with the network, data stores, or downstream services
  • The process running the function failing
  • The underlying Temporal services or database going down

This addressed a number of distributed systems concerns for us, and meant that:

  • We could use local variables instead of saving state to a database.
  • We didn’t need to set timers in a database for application logic like canceling an order that takes too long or for the built-in functionality of retrying and timing out transient functions like chargeCustomer.
  • We didn’t need to set up a job queue that workers polled, either for progressing to the next step or picking up unfinished tasks that were dropped by failed processes.

In the next post, we look at more of the delivery app’s code and learn how Temporal is able to provide us with durable execution.

If you have any questions, I would be happy to help! Temporal’s mission is helping developers, and I also personally find joy in it 🤗. I’m @lorendsr on Twitter, I answer (and upvote 😄) any StackOverflow questions tagged with temporal-typescript, and am @Loren on the community Slack 💃.

Learn more

To learn more, I recommend these resources:

More blog posts about our TypeScript SDK:

💬 Discuss on Hacker News, Reddit, Twitter, or LinkedIn.

Thanks to Jessica West, Brian Hogan, Amelia Mango, and Jim Walker for reading drafts of this post.