Workflows as Actors: Is it really possible?
Fitz
Actors, Entities, and Temporal Workflows
System design patterns exist to make applications easier to write. Think MVC for user-facing applications, circuit breakers for traffic routing, CQRS for data management, or "12-factor apps" for, well, everything. Most of these patterns are "technology first" in that they're meant to alleviate and mitigate shortcomings in our infrastructure: no single server can handle an infinite number of requests, so we need design patterns like load balancing and circuit breaking and models like client-server or peer-to-peer architectures.
But other patterns and models do apply differently depending on the domain. One example is the Actor Model. Like the fundamental tenet of "everything is an object" in object-oriented programming, the Actor Model presumes that everything is an actor. At their most basic, Actors accept and respond to messages, maintain state, and can create other Actors.
This blog post will give a short primer on the Actor Model and then show how to build Temporal Workflows that act like Actors.
Contents:
A Primer On Actors
My colleague Sergey wrote a blog post a couple of years ago imploring everyone to stop using the term "Actor," as it's a term that comes with "a minefield of conflations." But at the core, Actors are defined as being able to do three things. They:
- Send and receive messages
- Can create new Actors
- Maintain state
Those three features sound a whole lot like something else I know from programming: objects. Aside from inheritance, what are the main things that objects provide us as programmers? They:
- Have methods that can be called, and are able to call other objects' methods
- Can instantiate other objects
- Encapsulate state (fields) and computation (methods)
Compare the two lists. Notice anything?
Actors and objects are essentially the same. Looking back into the origins of OOP you'll find that passing messages was at the heart of that programming model, too. They may be different implementations of the same ideas, but both are used to make it easier to think about and reason about complex systems. Specifically, by isolating functionality (separating concerns, you might even say), both Actors and Objects make it easier to reason about behaviors in highly parallel and concurrent systems.
What else does that remind you of?
Temporal Workflows as Actors
It's initially unintuitive to use something called a "workflow" to build things called "actors." The term workflow, in the most general sense, means a sequence of steps that takes work from an initial state to a final state. By contrast, Actors can and do cease to exist at some point, but they generally stick around so that they can continue to receive, reply to, and send messages.
As we move from concept to implementation, how would you build an Actor (or Object) that doesn't really do anything until a message is received, is able to receive a message at any time without actively consuming resources, and also maintains state that's distinct from other instances of the same type?
To me, this starts to sound like a Temporal Workflow in that they:
- Can "sleep" forever, waiting for messages (Signals) to arrive
- Can start child workflows (i.e., other Objects/Actors)
- Have their own state, and control their own behavior for future messages that is distinct and loosely coupled with other Workflow instances of the same type
These qualities of a Workflow—the ability to send and receive Signals, start other Workflows, and keep state around for what amounts to an infinite amount of time—make Temporal a perfect fit for an Actor implementation.
Let's take a look at incorporating each of these features into a Temporal Workflow.
If you're already a Temporal user, most of these will be pretty familiar to you, but by thinking about them together, you may start to see new design patterns arise.
And, if you're not already familiar with Temporal, then don't worry! I'll link to additional resources at the end to give you some more context and ways to start building your own Workflows as Actors.
I'll go through these in order, which roughly corresponds to increasing complexity.
Sending, Receiving, and Acting On Messages
Workflow Update
In Temporal, the most common way of sending messages to Workflows is via Signals. But they're one-way: a Signal cannot return data back to the caller. This is perfectly fine in many cases, such as when you just want to inform an Actor of something that's happened (e.g., a customer has done something to earn a loyalty point). But in most cases, the Workflow-as-Actor will want to return something back.
As of server version 1.21, Temporal includes a new feature (in preview) called Update that allows a Workflow to make a change to its state and return a response. Suppose you had a Workflow representing a customer's loyalty points. You may then want to have a message ("Update" in Temporal-ese) that increments the point counter and returns a different message depending on how many points they've accumulated. In Go, this looks like:
// addLoyaltyPoint takes no parameters and just assumes adding a single point
err := workflow.SetUpdateHandler(ctx, "addLoyaltyPoint", func() (string, error) {
points += 1
if points == 1000 {
return "Welcome to Gold Status!", nil
}
return "Thanks for being a loyal customer!", nil
})
On the other end, sending an Update is a single call to handle both sending a request and waiting for the response. That is, one call with the Temporal Client (in the future from a Workflow, too) to send the request and wait for the return value. Then, you can extract the result from the response. Again in Go:
updateHandle, err := temporalClient.UpdateWorkflow(
context.Background(),
workflowId,
"", // RunId if you need it; defaults to currently running workflow run
"addLoyaltyPoint")
if err != nil { /* ... */ }
var updateResult string
err = updateHandle.Get(context.Background(), &updateResult)
if err != nil { /* ... */ }
log.Println("You've earned another point!")
log.Println(updateResult)
Using the new Workflow Update feature, sending and receiving things to and from a Workflow acts exactly like how you'd imagine calling a method on an object or sending a message to an Actor.
Signal-Signal and Signal-Query Pattern
In Temporal, the primary way of sending messages to Workflows is via Signals. But they're one-way: a Signal cannot return data back to the caller. This is perfectly fine in many cases, such as when you just want to inform an Actor of something that's happened (e.g., a customer has done something to earn a loyalty point). But in most cases, the Workflow-as-Actor will want to return something back.
To get this call and response feature out of a Workflow—a must-have for it to act as an Actor—you'll need to have your Workflow-Actors set up to Signal each other.
Suppose you have a Workflow representing a customer's loyalty points. You'll probably have a message (Signal) that increments the point counter. Then, if the customer earned enough points to warrant being promoted to a new level, it might then send a Signal back to the caller with that information. In Go, this could look like:
// Set up the Signal handler for the incoming "add points" message
addPointsChannel := workflow.GetSignalChannel(ctx, "addLoyaltyPoint")
selector.AddReceive(addPointsChannel, func(c workflow.ReceiveChannel, _ bool) {
var addPoints AddPointsSignalInfo
c.Receive(ctx, &addPoints)
customer.Points += addPoints.pointsToAdd
if customer.Points > 1000 && addPoints.SenderID != nil {
err := workflow.SignalExternalWorkflow(ctx,
addPoints.SenderID,
"",
"notifyNewStatus",
NotifyNewStatusInfo{
CustomerID: customer.ID,
Status: "Gold",
},
)
if err != nil { /* ... */ }
}
})
notifyOtherStatusChannel := workflow.GetSignalChannel(ctx, "notifyNewStatus")
selector.AddReceive(notifyOtherStatusChannel, func(c workflow.ReceiveChannel, _ bool) {
var notification NotifyNewStatusInfo
c.Receive(ctx, ¬ification)
logger.Info("Guess what? Your friend now has a new status!",
"CustomerID", notification.CustomerID,
"Status", notification.Status)
// Do something with this info.
})
In the case where the one sending the Signal isn't another Workflow—e.g., an external client that's handling displaying account information to customers—then you'll need to add a Query to be able to retrieve this information. (As a side note, if the Signal above did anything asynchronous, like starting a new Activity, there's a good chance this Query wouldn't have the absolutely latest information. The above implementation is synchronous, and doesn't have that problem.) Again in Go:
// ... In the Client ... /
// Query to get any updated messages from the new points total.
response, err := temporalClient.QueryWorkflow(
context.Background(),
workflowId,
"", // RunId if you need it; defaults to currently running workflow run
"getLoyaltyMessage")
if err != nil { /* ... */ }
log.Println(response)
// ... In the Workflow ... //
// Set up the Query handler for the response
err = workflow.SetQueryHandler(ctx, "getLoyaltyMessage", func() (string, error) {
if customer.Points > 1000 {
return "Welcome to Gold Status!", nil
}
return "Thanks for being a loyal customer!", nil
})
Using a combination of Signals and Queries (or just Signals, if messaging between two Workflows), sending and receiving things to and from a Workflow acts nearly the same as how you'd imagine calling a method on an object or sending a message to an Actor.
Creating Other Workflows
Like in OOP, it's trivial for a Workflow to create another Workflow. The main difference is that this sets up an explicit parent-child relationship (instead of in OOP just having a reference to the new object). A Workflow can have any number of Child Workflows (modulo the effect on Event History length; see discussion below) of any type that's able to be run by some Worker somewhere.
To maintain the Actor quality of being able to send messages, each parent and child must keep track of (or be able to calculate) the Workflow Ids of all of the others that it might need to send messages to.
For example, in Go, you might start (asynchronously) a Child Workflow and get its ID by doing:
var childWorkflow workflow.Execution
childExecutionFuture := workflow.ExecuteChildWorkflow(ctx, MyChildWorkflowDefinition)
err = childExecutionFuture.GetChildWorkflowExecution().Get(ctx, &childWorkflow)
if err != nil {
workflow.GetLogger(ctx).Error("Failed to get child workflow")
return err
}
childID = childWorkflow.ID
Then, in the Child Workflow, you can get its Parent Workflow ID by doing the following:
parentId := workflow.GetInfo(ctx).ParentWorkflowExecution.ID
As discussed below, in order to get a Workflow that can run nearly forever, you'll need to use Continue-As-New. This creates a new run of a Workflow, and when this happens, if the Workflow has children, the default behavior is to terminate all the Children Workflows.
You can override this behavior by setting the ParentClosePolicy
to ABANDON, which will keep the children running:
childWorkflowOptions := workflow.ChildWorkflowOptions{
ParentClosePolicy: enums.PARENT_CLOSE_POLICY_ABANDON,
}
ctx = workflow.WithChildOptions(ctx, childWorkflowOptions)
// start Child Workflow as shown above
That way, you'll be able to both create child workflow (i.e., create new Actors) and keep being able to send and receive messages to/from them!
Building a Workflow That Can Practically Run Forever
One claim that you may have heard about Temporal is that Workflows can run forever. Empirically speaking, that's impossible to prove: try as we might, we can't predict the future. But practically speaking, the durability provided by Temporal can get you to effectively forever.
There's only one thing stopping you: the Event History.
Temporal's durability is best on display when things go wrong: Imagine you wrote something that really ought to be properly in production but for whatever reason it's still running on your dev box under your desk. Naturally, you're in an office that's trying to conserve power and some of the outlets power down at night. If you accidentally have your dev machine plugged into it, that process is gone and has to recover itself when things come back in the morning.
The way Temporal facilitates this is through the Event History: the server persists the things the Workflow has already done (Events) such that when power comes back, the Workers compare what's already happened with what they think should have happened (i.e., the Workflow Definition it has). If things are different, a Worker can't guarantee that it's running the same code, but if things are the same, the Worker will just start running whatever comes after the last Event in the history.
But if that process (Temporal calls it "Replay") takes a long time, it could mean that instead of your Workflow being around and being useful forever, it's instead spending nearly all its time catching up with itself. Because of this, Temporal imposes some limits on Workflows' Event Histories: get up to 50Ki (51,200) Events or a size of 50MB? Temporal will terminate the Workflow.
To remedy this, Temporal Workflows can restart themselves with a clean Event History. This feature is called "Continue-As-New." Like recursion, it effectively amounts to a function calling itself, only dropping the call stack along the way. As such, there are two things you need to consider:
- What's lost when starting a new run? (Except for the entire Event History)
- If a Workflow run is restarted from the top of the function, what inputs would it need in order to actually function as a resumption of itself rather than just a brand new instance?
Avoiding Signal and Update Loss
For the first consideration, the main thing is any pending, unprocessed Signals and Update. You'll need to make sure to handle all pending Signals and Update before continuing as new. For example, in Python you might have Signals added to a queue and you process until empty before calling continue_as_new()
:
# Imports, error handling, and some declarations omitted for brevity.
@workflow.defn
class MyWorkflow:
def __init__(self) -> None:
self._pending_signals: list[str] = []
@workflow.run
async def run(self) -> None:
# ...
# Do other things, like wait for signals or start activities
# ...
cleanup_before_continuing()
workflow.continue_as_new()
def cleanup_before_continuing():
while len(self._pending_signals) > 0:
logging.info(f"Got signal with contents: {self._pending_signals.pop(0)}"
# Do whatever else is needed with the signal
@workflow.signal
async def submit_signal(self, name: str) -> None:
await self._pending_signals.append(name)
Or in Go, you might get Signal results (non-blocking) until there are none remaining:
// Imports, error handling, and some declarations omitted for brevity.
func MyWorkflow() error {
// ...
// Do other things, like wait for signals or start activities
// ...
cleanupBeforeContinuing(ctx)
return workflow.NewContinueAsNewError(ctx, MyWorkflow)
}
func cleanupBeforeContinuing(ctx workflow.Context) {
signalChannel := workflow.GetSignalChannel(ctx, "signal")
for {
var value string
ok := signalChannel.ReceiveAsync(&signalValue)
if !ok {
break;
}
workflow.GetLogger(ctx).Info("Received signal", "value", value)
// Do whatever else is needed with the signal
}
}
Note that there exists an edge case where Signals are coming in faster than your Workflow can process the backlog. In this case, the above code snippets could result in non-stop handling of those incoming signals until the Workflow is terminated by Temporal for exhausting the history size limits. It's best to architect your system such that there can periodically be a quiet period of incoming signals so that Continue-As-New can happen.
Restarting but Continuing
Workflows that are Continuing-As-New likely need to consider their inputs.
Consider a Workflow-as-Actor that's keeping track of a customer's loyalty points. A fresh invocation of this function would have the loyalty points at zero: a brand-new customer hasn't taken any action to earn points. But a long-time customer would be quite angry if their points periodically reset to zero and they lost their Gold status.
So, you'll need to be sure the initialization of the workflow accounts for this and takes in the appropriate parameters. For example, in a Python version of a customer loyalty Workflow might check to see if it's working with a new customer first and send a welcome message:
@workflow.defn
class LoyaltyWorkflow:
@workflow.run
async def run(self, points: int) -> None:
if points == 0:
await workflow.execute_activity(
send_welcome_email,
CustomerInfo("Name", "email@email.com"),
start_to_close_timeout=timedelta(seconds=10),
)
# wait for incoming signals or whatever else has to happen in this workflow
# ... later, to Continue-As-New:
workflow.continue_as_new(points)
When to Continue-As-New
Temporal has a built-in default Event History limit of 50Ki (51,200) Events or a total size of 50MB. A good best practice is to use the SDK's built-in history length API (e.g., workflow.GetInfo(ctx).GetCurrentHistoryLength()
in Go, or workflow.info().get_current_history_length()
in Python) and call Continue-As-New when that counter reaches about 20% of the limit (i.e., 10k events) so that there's plenty of buffer. In the future, the SDKs will also provide a similar for history size. Follow this GitHub issue for details.
Conclusion
While the term "workflow" may at first feel like simply a series of steps (reliably done, thanks to Temporal), it's also a perfect fit for use as Objects or Actors. With just a little consideration to how your Workflow accepts and responds to messages, how child Workflows are managed, and, via Continue-As-New, being able to keep it running for a very long time, your Workflows can function cleanly as Actors.
Looking to learn more about Temporal as a result of this post? Check out the following resources:
- Documentation: docs.temporal.io
- Temporal courses: learn.temporal.io/courses
- Developer Guide: docs.temporal.io/dev-guide