October 23, 2023

An Introduction to Worker Tuning

Fitz

Fitz

An Introduction to Worker Tuning

When you were first learning Temporal, Workers probably seemed pretty simple:

// err handling omitted for brevity
func main() {
	c, err := client.NewClient(client.Options{})
	defer c.Close()

	w := worker.New(c, "task-queue", worker.Options{})
	w.RegisterWorkflow(app.Workflow)
	w.RegisterActivity(app.GreetingActivity)
	err = w.Run(worker.InterruptCh())
}

That is, register your Workflow and Activities, and then run until canceled. Is it really that simple? Regardless of what’s happening behind the scenes, you inevitably grow to a point where one Worker is not enough to handle the number of new Workflows.

Whether you’re self-hosting or using Temporal Cloud, you need a pool of Workers to actually run your Workflows and Activities. This guide discusses a few Worker deployment patterns and some of the configurations that have the biggest impact on Worker performance.

Note that, as in all production settings, performance tuning is a much more involved process than what can be covered in a single general-purpose article. This guide hopefully gives you a starting point for what to look for, but you should expect Worker tuning to be a nuanced and ongoing process.

Setup

This guide assumes that you’re familiar with Temporal concepts and that you have written a proof of concept Workflow or have run one or more of the sample Workflows.

At the end of this guide, you will have the foundational knowledge necessary for deploying and managing your own Worker Pool.

This guide focuses on the Workers themselves and the strategies for performance tuning Worker-side code. For server-side tuning (in the case of a self-hosted deployment of the Temporal Cluster), see this article.

Background Terminology

First, some concepts you should be familiar with before we dive deeper.

Workers

The term "Worker" is used to refer to the entity that actually runs your code and makes forward progress on your Workflows and Activities. A single process can have one or many Workers.

Most of your first Worker code is minimal: create a client, register Activities and a Workflow, and then run until shut down. But a lot happens behind the scenes, including maintaining long-poll connections to the server, caching and batch-sending Events and Commands, receiving new Tasks from the Server, emitting metrics, streaming Event Histories when necessary, and more.

Task Queues

Task Queues are the entities that Workers poll for more work, and it’s more than just a string you need to make sure is identical in multiple places. Task Queues serve as a logical separation of work, such as different types of Workflows or Activities with different functional requirements.

Tasks

Task Queues contain Tasks created by the Temporal Server. Tasks represent and contain the context needed for a Worker to make progress. There are two types of tasks: Workflow Tasks, which include the type of the Workflow and the Event history. And Activity Tasks, which specify the Activity to run.

Commands

Commands are sent from Workers to the Temporal Server to indicate what needs to happen next, such as starting a Timer or scheduling an Activity to run.

Events

Events are the records of your Workflow’s progress and include, among other things, results of completed Commands. Events, rather than Commands, are what you see in the Workflow Execution Event History. You may commonly reference Events to see the results of an Activity: when a Worker successfully completes an Activity, it sends the results to the Temporal Server, which adds an ActivityTaskCompleted Event with those results to the Workflow’s history.

The other common use for Events is in the Worker: included in every Workflow Task is the history of events known by the server to have already completed.

How Workers Work

Workers are responsible for making progress on your Workflow and they do so by picking up Tasks from a Task Queue. These Tasks are either Workflow or Activity tasks, depending on the state of the Workflow Execution.

When Workers start, they open asynchronous long-polling connections to the server to get new tasks on the specified Task Queue. In the simplest form, when a Worker receives a Task, it does the following:

  1. Validate that the Task is not malformed (e.g., missing the Event History when it should be non-zero length)
  2. Check that this Worker can run the requested work (i.e., that the Workflow or Activity types are known and valid)
  3. If a Workflow Task, validate that the unseen portion of the Event History in the Task matches the known Workflow Definition (i.e., check for determinism)
    • Then, run the Workflow code as far as possible (i.e., until all coroutines have yielded and are waiting on the Server)
  4. If an Activity Task, run the Activity with the given arguments
  5. Send “Task Completed” back to the Server and any results (e.g., an Activity’s return values)

At this point, the Temporal Server records the relevant updates to this Workflow instance’s Event History and generates new Tasks as necessary.

Aside: Why two different types of Tasks?

If both Workflow and Activity Tasks follow the same pattern—check that the type is registered, then run the matching code—why are they different?

Recall why the concept of Activities exists in the first place: to ensure that unreliable work is isolated and doesn’t cause unnecessary failures of the overall system. For many categories of unreliability, such as API or network timeouts, simply retrying until successful is all that’s necessary to effectively remove that unreliability.

This conceptual separation between unreliable work and deterministic work necessitates the two different Task types. This allows the Activity Tasks to be substantially simpler than Workflow Tasks. It doesn’t matter whether it’s the first time this Activity has run, the Nth time, or happening after a long Timer. The Worker’s job is to run the Activity specified in the Task using the provided inputs and return the results.

A Workflow, meanwhile, must be deterministic. If it’s not, Temporal cannot guarantee that the Worker is running the same Workflow implementation that the Server thinks it should be. The Tasks that facilitate this, Workflow Tasks, must therefore include the information needed for the Worker to validate determinism. It can’t simply execute (or, like with Activities, re-execute) the next step in the Workflow. The Worker must check that how it thinks that step is reached is how the Server thinks it was reached. This is done by comparing the Event history given in the Workflow Task to what the Worker thinks the history would be if it re-ran the Workflow. We call this process “Replay.”

How many Workers do you need?

When to know you need more

As your application grows, it's inevitable that a single Worker won't be enough to complete all your Workflow or Activity Executions, as we’ll see below (in fact, best practice is to always have more than one). To keep up with the increased demand, you'll need to scale up your Workers. But as it always is with capacity planning, what’s the minimum number of Workers needed?

Reduced ability of Workers to start new Tasks

Workflow schedule_to_start graph

Above: An example graph of the workflow_task_schedule_to_start_latency metric. At around timestamp 11:55 the number of incoming Workflows increased dramatically, causing a sizable (and growing) backlog in Workflow Tasks. At 12:05, the number of Workers was increased and the schedule_to_start latency started falling (this could also happen with an increase in the number of pollers).

Your Workflows only make progress when Tasks are dequeued by Workers and the associated work is run. The Schedule-To-Start metric represents how long Tasks are staying, unprocessed, in the Task Queues. If it’s increasing or high, and your hosts are at capacity, add Workers. If your Workflow Schedule-To-Start is increasing or high and your hosts still have capacity, increase the Workers’ max concurrent Workflow Task configuration.

Reduced ability of Workers to run Tasks

Activity Worker Task Slots Available Graph

Above: A contrived example where Activities have started coming in too fast for the number of Worker slots to handle. Configured at 5, but drops to 0 with heavy load. (Note: this graph and the previous occurred on different days and exemplify different scenarios. They are not occurring concurrently; they should be considered as entirely distinct from each other.)

Worker Task Slots represent how much capacity Workers have to actually do the work. When a Worker starts processing a Task, a Slot is filled. So when there are zero slots available, your Worker pool is unable to start new Tasks.

The worker_task_slots_available metric should always be >0. If it’s not, add more Workers or increase the max concurrent execution size, according to what’s reported in the metric’s worker_type label:

w := worker.New(client, "task_queue", worker.Options{
		MaxConcurrentActivityExecutionSize:     1000, // default in the Go SDK is 1k
		MaxConcurrentWorkflowTaskExecutionSize: 1000, // default in the Go SDK is 1k
})

ℹ️ℹ️ℹ️

There’s a very remote coincidence that can occur where zero Task slots means you have a perfect balance between the ability for the Workers to complete work and the amount of incoming work. But in this coincidental balance, Tasks are still waiting in the Task Queue to be picked up by Workers. Even if that wait is a minuscule amount of time, no extra Task processing slots are available and so your application is not resilient to even the slightest increase in load.

ℹ️ℹ️ℹ️

This is not an exhaustive list of metrics or configuration values to use for determining when more Workers are needed, only the most common. For more, see the Worker Performance page in the Dev Guide and the full list of SDK and Server metrics.

When to know you have too many Workers

Perhaps counterintuitively, it is possible to have too many workers. Beyond simply paying for too many compute resources, it could mean your Workers requests to the Server for new Tasks get denied.

There are three things you should look at for determining if you have too many Workers:

  1. Polling success rate

This rate, defined as

(poll_success + poll_success_sync) \ (poll_success + poll_success_sync + poll_timeouts)

represents how often Workers request a Task from the Server, but instead of getting either a Task or a “no Tasks in the queue” response, the request times out.

An example PromQL for this rate is as follows:

  (
      sum by (namespace, taskqueue) (rate(poll_success[5m]))
    +
      sum by (namespace, taskqueue) (rate(poll_success_sync[5m]))
  )
/
  (
      sum by (namespace, taskqueue) (rate(poll_success[5m]))
    +
      sum by (namespace, taskqueue) (rate(poll_success_sync[5m]))
    +
      sum by (namespace, taskqueue) (rate(poll_timeouts[5m]))
  )

In a well-functioning system, this rate should be at or very close to 1.0 and consistently greater than 0.95. But in reality, it's expected and normal for some percentage Worker pollers to timeout or, in a high load situation, for an acceptable timeout rate to be as low as 0.8.

  1. Schedule-To-Start latency

As mentioned earlier in this guide, the Workflow and Activity Schedule-To-Start latencies should be consistently very low as it represents Workers’ ability to keep up with the number of incoming Activity and Workflow executions. The exception to this is if you are purposefully rate limiting Activities at a Workfer or Task Queue granularity; in this situation, it's expected—and is not necessarily a sign to optimize your Workers—that Schedule-To-Start latencies will be high.

When low, you likely have plenty of Workers. However, when some of their requests for Tasks are timing out, then not only is a portion of your Worker pool doing nothing, it’s inefficiently doing nothing. (In many cases this is okay though. For example, over-provisioning to account for load spikes. As in all capacity provisioning, plan according to your specific load characteristics.)

  1. Host load

Many situations can lead to Task requests timing out, including an overloaded Worker host. And, having an overloaded Worker host is often a sign that you need more hosts.

But in the contrary case, where requests for Tasks are timing out while Worker host load is low (or has plenty of headroom), the Workers’ requests are likely being dropped. (These "Task requests" are the API calls the Workers make to the Temporal Server to get more work. Those specific calls are PollWorkflowTaskQueueRequest and PollActivityTaskQueueRequest.)

Worker Host Load Graphs

Above: A set of three graphs exemplifying a situation where there are too many Worker pollers. From left to right: (1) poll success rate is near 1.0 but starts to fall. (2) Workflow and Activity Schedule-To-Start latencies are, eventually, quite low (1-3ms). (3) CPU usage during the same time period is consistently much less than 20%. In this example, Workflow Schedule-To-Start latency started relatively high, and so the operator tried to add many more pollers to their Workers. This resulted in reduced Schedule-To-Start latency and no effect on host load, but also caused >80% of Worker polls to fail. They should reduce the number of pollers and/or Workers.

When all three of these factors are occurring, you likely have too many Workers. Consider reducing the number of pollers, the number of Workers, or the number of Worker hosts.

Summary: Identifying Poor Worker Performance

Note that the metrics listed below are only some of the indicators for measuring Worker performance. Depending on your situation and environment(s), you may find other SDK metrics or Cluster metrics useful.

Metric(s)Metric TypeWhat to look forMeaning
workflow_task_schedule_to_start_latency and activity_task_schedule_to_start_latencySDKHigh and/or increasing. This latency should be consistently near zero.Schedule-to-Start timeout is the time between when a Task is enqueued and when it is picked up by a Worker. This time being long means that your Workers can’t keep up — either increase the number of workers (if the host load is already high) or increase the number of pollers per worker.
worker_task_slots_available You should filter this metric on the worker_type tag (possible values: "ActivityWorker" or "WorkflowWorker")SDKBoth metric types (WorkflowWorker and ActivityWorker) should be always >0task_slots represents the number of executors (threads or otherwise) a Worker has available to actual run your code. In order to have room for incoming Workflows and Activities, this number should always be greater than 0.
poll_success vs poll_timeoutsServerUnderutilized Worker hosts, plus low Schedule-to-Start latencies, plus low (<90%) poll success rate.You may have too many workers. You might also start to see RPC call failures and rate limiting from the Temporal Server. Either way, consider reducing the number of Workers.

Conclusion

This guide outlines one of the most fundamental pieces of Temporal: Workers. When you don’t have enough Workers, your Workflows can stall, slow down, or possibly never finish. When you have too many Workers, at best you’re paying too much for compute resources; at worst, your Workers are rate limited, denying them the ability to poll for new Tasks.

Either way, striking the balance of having the right amount of Workers can be tricky. Hopefully this guide provides you with pointers for how to find that balance in your application.

Title image courtesy of Pop & Zebra on Unsplash.