March 16, 2023

How Durable Execution Works

Loren Sands-Ramshaw

Loren Sands-Ramshaw

This is part 2 of the series, "Building Reliable Distributed Systems in Node.js." In part 1, we went over what durable execution is, its benefits, and what a durable function looks like. In this post, we'll look at how Temporal can provide durable execution.

A function that can't fail, can last forever, and doesn't need to store data in a database? Sounds like magic. There must be a gotcha—like only a small subset of the language can be used, or it only works on specialized hardware. But in fact, it's just JavaScript—you can use the whole language, and it runs on any server that can run Node.js.

So how does this all work? You can take a look at the How Temporal Works diagram, which explains the process with Go code. In this post, we'll go through the process with the TypeScript code from the previous post in the series.

Client ↔️ Server ↔️ Worker

To start out, a Temporal application has three parts: the Client, the Server, and the Worker.

client-server-worker

The Client and Worker both connect to the Server, which has a database and maintains state. The Client says things like "start the order() durable function," "send it a delivered Signal," and "terminate the function." The Worker is a long-running Node.js process that has our code and polls the Server for tasks. Tasks look like "run the order() durable function" or "run the normal sendPushNotification() function." After the Worker runs the code, it reports the result back to the Server.

In our delivery app, we create the Client in temporal-client.ts and use it in our Next.js serverless functions:

We create the Worker in apps/worker/src /worker.ts:

import { NativeConnection, Worker } from '@temporalio/worker'
import * as activities from 'activities'
import { taskQueue } from 'common'
import { namespace, getConnectionOptions } from 'common/lib/temporal-connection'

async function run() {
 const connection = await NativeConnection.connect(getConnectionOptions())
 const worker = await Worker.create({
   workflowsPath: require.resolve('../../../packages/workflows/'),
   activities,
   connection,
   namespace,
   taskQueue,
 })

 await worker.run()
}

run().catch((err) => {
 console.error(err)
 process.exit(1)
})

We pass the Worker our code:

And set up Render to automatically build and deploy on pushes to main:

render

Running each part

In production, our web apps and their serverless functions are deployed to Vercel, our long-running Worker process is deployed to Render, and they both talk to a Server instance hosted by Temporal Cloud. The Server is an open source cluster of services which work with a database (SQL or Cassandra) and ElasticSearch. You can also host all that it yourself, or you can save a lot of time and get peace of mind with higher reliability and scale by paying the experts to host it 😄.

In development, we can run all three parts locally. First, we install the Temporal CLI, which has a development version of the Server:

  • Homebrew: brew install temporal
  • cURL: curl -sSf https://temporal.download/cli.sh | sh
  • Manual: Download and extract the latest release and then add it to your PATH.

We start the Server with:

temporal server start-dev

In another terminal, with Node v16 or higher, run:

npx @temporalio/create@latest ./temporal-delivery --sample food-delivery
cd ./temporal-delivery
npm run dev

The dev script runs the two Next.js web apps and the Worker. The menu is running at localhost:3000, where we can click "Order", and the driver portal is at localhost:3001, where we can mark the item we ordered as picked up and delivered. Once we've done that, we can see the "Delivered" status of the order in both of the web apps. We can also see the status of the corresponding Workflow Execution in the Server's Web UI at localhost:8233:

workflow-list

We can see its status is Completed, and when we click on it, we see the Event History—the list of events that took place during the execution of the order() function:

event-history

The first event is always WorkflowExecutionStarted, which contains the type of Workflow being started—in this case, an order Workflow. We'll look more at events in the next section.

In the Queries tab, we can select the getStatus Query, which (assuming our Worker is still running) will send a Query to the order function, which responds that the order was delivered, the time of delivery, and which item was delivered:

query

Sequence of events

Now let's look at what happened behind the scenes during our order.

Start order

When we clicked the "Order" button, the API handler used a Temporal Client to send a start command to the Server:

apps/menu/pages/api/[trpc].ts

start

The Server saves the WorkflowExecutionStarted event to the Event History and returns. It also creates a WorkflowTaskScheduled event, which results in a "Workflow Task" (an instruction to run the order() function) getting added to the task queue, which the Worker is polling on.

workflow-task

The Worker receives the task, the Server adds the WorkflowTaskStarted Event, and the Worker performs the task—in this case, calling order(3). The order function runs until it hits this line:

const { chargeCustomer, refundOrder, sendPushNotification } = proxyActivities<typeof activities>({ ... })

export async function order(productId: number): Promise<void> {
 ...

 await chargeCustomer(product)

Call Activity

When the function calls the chargeCustomer Activity, the Worker tells the Server:

call-activity

The Server adds the WorkflowTaskCompleted event (the Workflow Execution didn't complete—just the initial task of "run the order() function and see what happens") and an ActivityTaskScheduled event with the Activity type and arguments:

activity-task-scheduled

Then the Server adds an Activity Task (an instruction to run an Activity function) to the task queue, which the Worker picks up:

get-activity-task

In development, we're only running a single Worker process, so it's getting all the tasks, but in production, we'll have enough Workers to handle our load, and any of them can pick up the Activity Task—not just the one that ran the order function.

The Worker follows the Activity Task instructions, running the chargeCustomer() function:

packages/activities/index.ts

Which calls paymentService.charge():

packages/activities/services.ts

If the function throws an error, the Worker reports it back to the Server:

activity-failed

And the Server schedules a retry. The default initial interval is 1 second, so in 1 second, the Activity Task will be added back to the queue for a Worker to pick up.

If the function completes successfully, the Worker reports success (and the return value, but in this case there is none) back to the Server:

activity-completed

The Server adds the ActivityTaskStarted and ActivityTaskCompleted events. Now that the Activity is completed, the order() function can continue executing, so the Server adds another WorkflowTaskScheduled event. It also adds a corresponding Workflow Task to the queue, which the Worker picks up (at which point the Server adds another WorkflowTaskStarted event).

Second Workflow Task

If the Worker still has the execution context of the order function, it can just resolve the chargeCustomer(product) Promise, and the function will continue executing. If the Worker doesn't have the execution context (because it was evicted from cache in order to make room for another Workflow—see WorkerOptions.maxCachedWorkflows—or the process crashed or restarted), then the Worker fetches the Event History from the Server, creates a new isolate, and calls the function again:

get-history

This time, when the function hits the await chargeCustomer(product) line, the Worker knows from event 7, ActivityTaskCompleted, that chargeCustomer has already been run, so instead of sending a "Call Activity" command to the Server, it immediately resolves the Promise. The function continues running until the next await:

packages/workflows/order.ts

const notPickedUpInTime = !(await condition(() => state === 'Picked up', '1 min'))

condition() will wait until either the state becomes Picked up or 1 minute has passed. When it's called, the Worker tells the Server to set a timer:

set-timer

The Server adds the WorkflowTaskCompleted and TimerStarted events to the Event History, and sets a timer in the database. When the timer goes off, a TimerFired event will be added along with a WorkflowTaskScheduled and Workflow Task on the queue telling the Worker the 1 minute is up, at which point the Worker will know to resolve the condition() Promise.

Send Signal

But in our case, that didn't happen. Instead, before a minute was up, we clicked the "Pick up" button in the driver portal, which sent a pickedUp Signal to the Workflow:

send-signal

The Server then added two events: a WorkflowExecutionSignaled event with the Signal info, and another WorkflowTaskScheduled event. Then a Workflow Task was added to the queue with the Signal info, which was picked up by the Worker:

get-signal

The Worker then runs the pickedUp Signal handler:

setHandler(pickedUpSignal, () => {
 if (state === 'Paid') {
   state = 'Picked up'
 }
})

packages/workflows/order.ts

The handler changes the state to Picked up, and after Signal handlers have been called, the Worker runs all the condition() functions. Now () => state === 'Picked up' will return true, so the Worker will resolve the condition() Promise and continue executing, to see what the function does next, which will determine the next command(s) it sends to the Server.

Event History

All together, the part of the Event History we covered was:

just-history

The Event History is the core of what enables durable execution: a log of everything important that the Workflow does and gets sent. It allows us to Ctrl-C our Worker process, start it again, open up the UI, select our completed order Workflow, go to the Queries tab of the UI, and send the getStatus Query, which the Server will put on a queue for the Worker to pick up, which won't have the Workflow in cache, so it will fetch the Event History from the Server, and then call the order() function, immediately resolving any async functions with the original result from History, and then calling the getStatus handler function:

setHandler(getStatusQuery, () => {
 return { state, deliveredAt, productId }
})

packages/workflows/order.ts

Since the whole function and all the Signal handlers have been run, the { state, deliveredAt, productId } variables will all have their final values from when the function was originally executed, and they'll be returned to the Server, which returns them to the Client, which returns them to the UI to display on our screen:

query

Summary

We examined how durable code works under the hood—how we can:

  • Write functions that can't fail to complete executing (since when the Worker process dies, the next Worker to pick up the task will get the function's Event History and use the events to re-run the code until it's in the same state).
  • Retry the functions that might have transient failures (if the chargeCustomer Activity can't reach the payment service, the Server automatically schedules another Activity Task).

The best part is that these failures are transparent to the application developer—we just write our Workflow and Activity code, and Temporal handles reliably executing it. 💃

In the next post, you’ll learn more things you can do with durable functions. To get notified when it comes out, you can follow us on Twitter or LinkedIn. Also, check out our new Temporal 101 course!

🖖 till next time! —Loren

Thanks to Brian Hogan, Roey Berman, Patrick Rachford, and Dail Magee Jr for reading drafts of this post.