Parallelism and Concurrency in a Distributed Event Loop
Fitz
When computing started, it was relatively easy to reason about things as a single series of computations.1 It didn't take long though before we introduced the ability to compute more than one thing at a time. These days, we take for granted computers' ability to multitask. We know that it's because we have multiple cores, CPUs, and servers. Yet somehow "single-threaded" things like JavaScript and Python are also able to "do more than one thing at a time."
How? There are two different concepts at play here, often at the same time, often confused, yet entirely distinct: Parallelism and Concurrency.
Quickly defined, they are:
Concurrency
: The ability to handle multiple things in an undefined order.
Parallelism
: More than one thing literally running at the exact same time.
Or, put another way: concurrency gives the illusion that multiple things are happening at the same time (as a result of how a program or runtime environment was designed). With parallelism, multiple things actually are happening simultaneously.
Let's take a deeper look. Then, I'll show you how Temporal provides you with reliable concurrency and parallelism with a durable, distributed event loop.
Concurrency
Concurrency is most easily understood through a concept formally called "time-division multiplexing",2 or more commonly known as "multitasking." I might spend a few minutes writing this post, then check Slack messages, then read some news, then come back to writing. All the tasks are making forward progress before any one of them is done,3 but I can't write and do the dishes at the same time. That's concurrency: I'm capable of pausing one thing and coming back to it later after doing something else in the meantime.
But just because I can't check email while also writing a blog post doesn't mean you couldn't be checking email while I'm writing these words. (That'd be parallelism.) You and I are still employing the same skills, though: writing a post and reading email. Similarly, a well-decomposed program might have one function for calculating a Fibonacci number and a different one for a fractal sequence. Said program could be designed in such a way as to enable pausing one function and allowing the other to run for some time. That, again, is concurrency.4
But, if we popped those functions into separate threads or forked one off to another process and ran on a multi-core machine,5 suddenly we'd have parallelism, too!
Parallelism
And so, parallelism might sometimes imply concurrency, but "concurrent" doesn't necessarily mean "parallel." The main difference: parallelism has things running literally at the same time, while concurrency has them run interleaved.
Parallelism is both trivially easy and wildly complex to achieve. Nearly every modern computing device has a multi-core processor,6 even incredibly small and cheap ones. With that, we do get some parallelism for free: Slack truly is running at the same time as the Google Docs tab I'm writing this in.
In one sense, if your entire computer were a single "program," then the work of building concurrency into it was done naturally as part of separately building out the different functions—chat, browser, terminal. It just happens to run in parallel because of the hardware it's running on.7
Parallelism and Concurrency
As you might imagine, things get complicated from here. Adding threads, hyperthreads, co-processors (e.g., GPUs or VPUs), or entire other machines gives us multiple levels of parallelism and/or concurrency.
And so knowing which one is happening within an overall system is often difficult to determine. (This is a big reason why they're confusing and often mixed up or interchanged.)
Consider a bunch of independent processes that you expect to take the same amount of time to run (as in, they run in a reasonable amount of time and do, in fact, finish). When you kick off the processes, and they mosey on down to your multicore CPU, eventually they complete and produce results.
Was the CPU running those processes concurrently, in parallel, both, or neither (serially and in a deterministically predictable order)? Does it even matter if everything happened successfully?
Consider instead that these processes run on different CPUs in different servers8, yet need to communicate with each other, we're now solidly in the realm of parallelism and concurrency in distributed systems.
If those processes can be run on any machine at any time, in any order, be able to communicate with each other (also at any time), and access common resources (like a database), you're suddenly opening up a whole new world of complexity.
On a single system, those complexities are handled by the operating system, which acts as the mediator between a massive crowd of resource-hungry processes and the resources they're yearning for. This OS mediator was good enough for most of our computing needs for a long time (and still entirely necessary), giving us APIs for accessing shared resources (like the filesystem or network) from any process on the system.
But as our applications have grown to be bigger than a single machine, the single-system OS model breaks down: among other challenges, the kernel on machine A has no idea what resources are available on machine B. And yet, processes need some kind of compute resource; they may not care if it's on machine A or B (as long as other properties like location, performance, etc. are maintained), and just need to be run on something.
Distributed Systems & Temporal
Therein lies the complication with a distributed system. Not only is it a highly parallel environment, there's nothing inherent in the system to mediate access to resources. Critically, this is important because hardware, networks, everything fails eventually and work gets lost unless it can be rescheduled on a "good" resource.
To mitigate this fact of distributed system life, we inevitably end up needing to build out something that allows for recovery from failure or is tolerant of those failures in the first place. System design patterns like microservices vs. monoliths, event sourcing and CQRS, or message queues and producers/consumers arise in large part to be able to tolerate and recover from a variety of failures.
Often, these patterns work by breaking an application down into smaller parts and persisting the state of what's already happened, thereby duplicating as little work as possible when something fails. (This is, albeit in a bit of a mental leap, a form of concurrency.9)
Those patterns yield their own complications, though.10 As the system's complexity skyrockets, debugging problems, maintenance, or even just adding features all get difficult and error-prone.
Temporal as a Concurrent (and Parallel) Distributed System
While Temporal doesn't solve all problems inherent in highly parallel systems, it does mitigate most of them. By building an application with Temporal, you are getting parallelism, concurrency, and fault tolerance for free.11
Consider an online food ordering and delivery application, where the main "business unit" is an order.12 It's mostly just a high-level conceptual view of the status of one item in the [distributed] system; each step in the order's lifecycle is independent of the others in that once it gets the work request, it can work without any additional communication until done. But the lifecycle as a whole is the fundamental concern of the customer ("where's my order right now?").
If we treat the order itself as a single software entity—say, a function—it might look something like this pseudo-JavaScript:
function orderStatus(order) {
paymentSucessful = await validateAndPay(order);
if (paymentSucessful) {
// [update status display for "✔︎ order placed"]
} else {
// [inform customer of failure]
}
foodCooked = await sendToRestaurant(order);
// [success/error handling as above]
pickedUp = await findDeliveryDriver(order);
// [success/error handling as above]
delivered = await deliverOrder(order);
// [success/error handling as above]
return "Success!"
}
Assuming the OS-level process running this function is perfectly reliable and never goes down,13 JavaScript is able to concurrently run many, many instances of this function. JavaScript, and many other languages, operate on a single-threaded non-blocking event loop model.14 Extremely briefly, this works as follows:
- Code runs until done or when it yields (via an await or a timer, e.g.,
setTimeout
in JavaScript). a. If yielding from an await or timer, the thing being awaited is added to a heap of tasks. - The main thread, aka the event loop, checks for ready-to-run tasks. This could be other functions, a timer whose time has come, or an awaited future/promise that now has resolved. a. If yes, run them.
- Repeat ad infinitum.
This allows two functions for, say, Order A and Order B, to run interleaved with each other and make independent progress, regardless of how long the other takes. By this model of concurrency, you get a very strong illusion of parallelism: assuming the restaurants and delivery drivers take a random amount of time to complete, sometimes A will make it all the way through findDeliveryDriver
before B even starts, and while A is waiting for deliverOrder
to resolve, B completes. Other times, A will finish before B. These two functions appear to run in parallel.
Temporal works almost exactly like this. In fact, Temporal’s SDKs always only run one thing at a time for a given Workflow, even when using ostensibly parallel things like the Go SDK's workflow.Go().
The fundamental difference though is that the queue of ready to run tasks is not kept in-memory, but in the Temporal Server. This way, you can have many, many "main threads, aka event loops" in the form of Temporal Workers.
So while the JavaScript event loop + task queue might look like this:
Temporal's version looks more like:
Caption: An abstract cloud labeled "Worker Fleet" contains numerous boxes, each of which is labeled "Event Loop" and contains an arrow looping back on itself. Outside of the cloud is a box labeled "Temporal Server", which contains a representation of an infinitely-growing "Task Queue". There are bi-directional arrows between many of the "Event Loop" boxes and the "Temporal Server" box.
(Note: The Temporal Server is actually a cluster of many different services, not just a Task Queue.)
With this, Temporal gives you parallelism through being able to have, practically speaking, as many Workers as you need. It gives you concurrency by those Workers running with a kind of event loop. And it gives you distributed computing by supporting those Workers being run not just in one process, but on any machines that you want (as long as they can still reach the Temporal Server).
It also largely solves the problem of a single process crashing, losing connectivity, and losing work as a result: If the "tasks" created by an await
are stored elsewhere, the processes running the event-loop can come and go with little-to-no impact on the function's progress. As a result, the manual effort of implementing producers/consumers, message queues, event sourcing, or other architectural patterns is almost entirely removed.
Summary
Parallelism and Concurrency are topics that are often conflated with each other. Most often this is because if a single program is effectively and correctly designed to take advantage of parallelism, then it is also concurrent.
However, "only a single thing at a time" environments like JavaScript or Temporal Workers also have a high degree of concurrency, but with no internal parallelism for running a single program. (With parallelism achieved only through running multiple JS VMs or Temporal Workers.)
Concurrency, therefore is the ability of a program to handle different components, interleaving with each other as they run. While parallelism is the environment's ability to literally run more than one thing at a time.
Whether you were comfortable with the distinction between these two concepts or not, hopefully this post has improved your sense of them and how they are involved in Temporal applications.
For more on Temporal, check out the following links:
- Documentation: docs.temporal.io
- Dev Guide: docs.temporal.io/dev-guide
- Temporal Python 1.0.0 – A Durable, Distributed Asyncio Event Loop
And the best way to learn Temporal is with our free courses.
Footnotes
Footnotes
-
But actually, and still, very hard. ↩
-
Originally used in sending signals over a wire; in computing/scheduling this is also known as "time slicing." ↩
-
What does "forward progress" even mean when reading an infinite stream of internet content? ↩
-
With the example given here, this is sometimes called Cooperative Multitasking. ↩
-
Just forking into multiple threads or processes doesn't automatically get you parallelism though; doing so on a single-core CPU is still just concurrency, just happening at the hardware level. ↩
-
Although, curiously, you can still buy single-core processors and supporting motherboards. ↩
-
I'm of course conveniently skipping over the complexities in OS Schedulers or the concurrency that occurs in interrupt-driven systems (e.g., single-core CPUs) running multiple processes. ↩
-
Let alone different datacenters or regions. ↩
-
Suppose I have a web app with a page composed of the results from multiple microservices. That architecture of splitting things out into those microservices does two things: (1) makes the page more resilient to failure (if one microservice is unavailable or slow, it doesn't necessarily affect the other components); and (2) allows those components to be built in any order, rather than strictly sequentially. ↩
-
Parallelism is hard. ↩
-
Minus the cost of, well, writing the application in the first place. ↩
-
Depending on your perspective and how you design this application, it could have additional "main" units, like a customer, driver, restaurant, or others. ↩
-
Wouldn't that be nice. ↩
-
Non-blocking mostly because of the aforementioned Cooperative Multitasking because bad (whether intentionally so or not) actors can block the thread. ↩