The Distributed Machine
Derek Wilson
The Distributed Machine
… or, Temporal from the perspective of computer architecture
Temporal is a Game Changer
We've written and presented at length about the direct and significant practical impact that Temporal provides to developers writing distributed applications. It simplifies systems that would normally require sprawling microservices and makes strong guarantees around the reliability of business logic through durable execution.
We believe that Temporal is the next step in the natural progression of computing machines.
Our claims may seem bold and outrageous, but when engineers start using Temporal they–consistently–realize that we have something special. Let's ratchet things up a notch and see if we can get really ridiculous and still resonate with developers.
Today I want to take a brief(1) look at the paradigm shift that came out of virtual machines, how code runs on machines, how operating systems manage programs running on a machine, and how container orchestration has replaced the role of the OS in the cloud. Then, we'll dive into how durable execution—and Temporal in particular—abstracts distributed processing away from applications and transforms the cloud. From a computer engineering perspective, the result is a new kind of machine beyond the physical or virtual: a distributed machine.
From Physical to Virtual Machines and Containers
It’s definition time! A machine is anything that executes the instructions of a program, a program specifies a set of instructions for a machine to execute, and a process describes an actively running execution of a program on a machine. Today there are two generally accepted types of machines: physical and virtual. Before we make the case for a new type of machine, let’s talk about what we have today.
Before virtual machines, we just had hardware. Our physical machines had operating systems installed on them and ran whatever software we needed. Multi-user operating systems didn’t have very strong isolation and any changes to the operating system installed on the hardware impacted all of the users and software running on the system. As technology advanced toward the level of flexibility in managing machines that exists today, we uncovered the value of decoupling the environment–where software runs– from the hardware we wanted to run it on.
For our purposes, we’re going to touch on complex topics and stop short of deeply illuminating them for the uninitiated. The goal here is to demonstrate patterns and paradigms in order to relate them to Temporal as enabling a distributed machine, not to teach computer architecture or operating system design. There are exceptions to every rule and leaks to every abstraction. Please give me some leeway to be a little general and hand wavy.
If these concepts pique your curiosity and you want a deeper dive, I highly recommend Computer Architecture: A Quantitative Approach by Hennessy and Patterson (just “Hennessy and Patterson”), The Design of the UNIX Operating System by Bach (I don’t know if this one has a nickname), and Engineering a Compiler by Cooper and Torczon (there’s a third edition that just came out last year I still need to check out). And it’s really got nothing to do with any of this, but while I’m recommending important books in computing I might as well throw in The C Programming Language by Kernighan and Ritchie (just “K&R”), and The Art of Computer Programming by Knuth (it’s long - I’m still working through it).
The virtual machine provided unprecedented flexibility in managing compute workloads on top of physical machines. Rather than reinstalling or rebuilding a physical machine, new virtual machines can be created simply by copying a file. Replacing hardware becomes much easier when you can pause a virtual machine and move it to new hardware. Containers further isolated programs in ways that divorce applications from the physical or virtual machine on which they run. Advancements like container orchestration platforms act as something like a distributed process management system, managing containers not just within a single machine, but across many machines. These technologies make it easier and easier for us to treat processes and machines as cattle rather than pets (as the adage goes), which significantly improves the recoverability and maintainability of systems.
But while virtual machines and containers provide incredible flexibility, running coordinated software across many machines has proven quite difficult. To take advantage of a multi-machine environment, software engineers have had to wrestle with advanced techniques in computer science like distributed consensus, resource contention and cache invalidation or make use of heavy weight technologies to hide these problems. The downside of most off the shelf solutions is that they solve only a narrow slice of the distributed systems problem space like shared storage or leader election. Developers still need to fill in all the gaps or glue a bunch of technologies together.
To this point, we’ve abstracted the idea of machines and processes with VMs and containers, but there’s something missing. When I run a program, I don’t really care which core or CPU or even machine it runs on (or, indeed, whether it all runs on just one machine) as long as it makes progress. If my code is running on a machine that dies, why can’t my running program just keep executing on another available machine, picking up from exactly the same line of code where the former machine stopped making progress?
Temporal provides a revolutionary, world-altering answer to this question: we say you can do exactly that. Temporal provides a solution that lets you just run your program across arbitrarily many machines as if it were a single higher level distributed machine.
Rather than executing code on any one virtual or physical machine, Temporal enables executing code on a distributed execution system thereby obviating the need to build application specific distributed systems.
Let’s look a little deeper at machines and processes to talk about how this is possible.
Executing Code and Microservices
We don’t need a deep understanding of how compilers or CPUs work, but to solidify our case for a distributed machine it’s useful to have some hand wavy idea of a couple things.
Developers don't write in machine code. We rely on high-level languages and use language features like functions, loops, and datastructures to write programs that are translated into a sequence of instructions the machine can understand. To move from a high-level language to machine code, instructions are grouped into basic blocks (sequential instructions without branches) as part of a control flow graph. Breaking programs down in this way enables grouping and optimizing streams of instructions for running on the target machine.
It’s also important to point out that if we just used this model of process management it would be very difficult to coordinate work across multiple independent machines all running their own OS. One system would need to serialize the state of a running program, including CPU cache and registers, as well as main memory and whatever files and sockets might have been open, and send a copy to another system to take over running the program. Doing this in a fast and efficient way is an ongoing area of research and docker offers experimental support for checkpointing running containers via CRIU.
Until Temporal came along, we haven’t been able to simply write the end-to-end logic for our distributed systems in one program. As a workaround we’ve learned to break down our larger programs into smaller parts and run them as microservices that communicate with one another. Breaking things apart like this enables us to run small pieces of our program in a replicated fashion to meet performance or reliability goals, but the burden on the programmer is fairly high.
Software That Runs Software
Unless you’re programming performance-sensitive software for microcontrollers, you’re probably not running your application directly on hardware. Operating systems do a lot of jobs. The three things we want to focus on here are:
- The abstraction of available hardware
- Enabling many programs to run at once through scheduling running applications on available resources
- Enabling communication and coordination between multiple programs
Most of us aren’t directly accessing hardware by writing to various addresses or setting registers on devices and dealing with sharing access with other programs; the operating system handles all of that for us. Our programs make use of language features or libraries that interact with hardware using drivers provided by the operating system. This significantly simplifies writing software. Language maintainers are able to provide higher-level features to programmers as well. These days we accept that we benefit quite significantly from running our software on software that manages the hardware.
Running your software alongside other applications is also something it’s easy to take for granted, but the operating system has to take an active role in scheduling programs on whatever CPUs and cores are available to the system. Processes get time slices to run based on various priorities, patterns, interrupts, and blocking operations. On a single core system, while applications appear to all be “running” at the same time, only one process at a time has its instructions actually making progress. On modern multi-core systems, the OS obviates the need to care about which core a process will execute on as well.
There are a variety of options that processes have to communicate with each other when they are running on the same OS. Unix-like systems have pipes, files, signals, and sockets to choose from. Some of these options are more readily available to processes that spawn subprocesses as they can easily set up those pipes and propagate signals to any child process they spawn.
Running software on software (in this case, an operating system) provides a lot of functionality that reduces the amount of code we need to write to take advantage of the system we are running on, and coordinate with other processes running on the same system.
Distributed Processes vs Distributed Computing
Systems like Google’s Borg, Kubernetes (which was inspired by Borg), and Apache Mesos allow us to decouple the idea of running processes from the machines on which they run. This has given birth to what the industry has decided to call cloud native computing, ushering in an era where the abstraction layer on which we run software is no longer a single OS on a physical or virtual machine, but a container orchestration system which manages processes across many machines. The execution platform is the cloud itself, if you will.
While cloud native infrastructure has improved operations dramatically, writing cloud native applications still takes a lot of effort. The fundamental operational needs are handled: pulling images, running containers, service discovery, role-based access control (RBAC), networking, and so on, but developers still need to write distributed systems to accomplish business goals on top of container orchestration.
Rather than a CPU that executes a stream of low-level instructions, in a distributed system the processing model is a higher order unit: a loosely coupled collection of machines executes multiple programs that cooperate to achieve a business objective. The smallest unit of work we distribute is a process (think Docker container) or a group of processes (think Kubernetes pod).
Developers typically employ networked microservices with HTTP or RPC APIs, use queues, distributed data stores, replicated state machines, conflict-free replicated data types, leader election, or any number of other approaches to implement algorithms across the cloud to make progress on business logic. Such systems necessarily have a significant amount of complexity that exists only to handle things like consistency and durability in a distributed environment.
The key takeaway here is that just having a container orchestrator does not magically give you the ability to run a sequence of business logic across a variety of hardware. Managing distributed processes is necessary but not sufficient. Software currently must be coordinated as a distributed system which makes use of these distributed processes to achieve distributed computing.
What Does Any of this Have to Do with Temporal?
We’ve talked about the organization and execution of programs on hardware, the role of the OS in removing the complexity of running software on hardware, and the complexity of building application-specific distributed systems.
Temporal takes these single-machine focused concepts and elevates them to the cloud level.
Rather than handling code execution through context switching on hardware running a stream of instructions organized by a compiler, Temporal manages running your program as a series of higher-than-basic-block level tasks. We typically refer to these series of tasks as “Workflows,” but any program could be a “Workflow” because it’s also ultimately just a graph of instructions.
There are some key differences between a control flow graph of basic blocks and a Temporal Workflow as a graph of tasks. First, tasks are larger collections of instructions and they include control flow. This is important for performance and recoverability reasons as Temporal durably saves the result of every task (which can get expensive even in the best case scenario). Second, there are two types of tasks: deterministic (Workflow) and non-deterministic (Activity). The advantage of making this distinction is to enable restoration of program state. Deterministic operations can be re-executed in order to recover from failure, but non-deterministic operations need to either be retried or fail the execution of the Workflow.
Really, this is just Temporal managing a higher-level graph where the nodes are a subset of the control flow graphs of basic blocks for the whole program. This is essentially the same kind of grouping of code that compilers do to optimize for a target machine but at a higher level. Due to the complexity of balancing performance, determinism, retries, and idempotency concerns, it is up to the programmer to decide how to break down tasks, rather than a compiler. This is probably the toughest part of using Temporal right now, but it’s way better than building your own distributed system.
The process scheduling that operating systems do for a single machine, Temporal does for an arbitrary collection of one or more machines and processes regardless of where and how they run. The only requirement is to be able to communicate over gRPC with a Temporal Cluster. Instead of doing this at the single process or instruction level, Temporal handles executing the next task for a workflow on any process on any machine that implements that workflow.
When a Workflow executes, it’s just as if an entry point function were run in your program. Temporal schedules tasks for running Workflows on the available processes that implement that workflow (these are typically called Workers in Temporal parlance). Rather than having full control of CPUs and program contexts, Temporal inverts the problem and lets Workers that have the capacity to execute tasks ask for more work by polling. This is still very much scheduling programs on available resources though. And one Temporal server can manage many Workflows running on many Workers listening to many task queues across many namespaces, just as an OS can manage many processes running with various permissions for many users on a single machine.
But Temporal isn’t really an OS for the cloud by itself. With container orchestration, processes can be run on whatever hardware is available, and Temporal uses those processes to make progress on a run of a program across one or more machines. Any task at any point of any given Workflow run can be executed by whatever container is available. The combination of cloud-native computing and durable execution give us the beginnings of something that looks like a cloud OS. There are still plenty of missing parts, but the reality is that just by deploying some replicated containers we can start running single programs all over the cloud, rather than on whatever single machine a container happens to land on.
And this next-level-of-abstraction-sort-of-OS can’t be the same as an OS that runs a single physical or virtual machine (Plan 9 demonstrates why that doesn’t work even though it’s amazing). A different kind of OS is needed for a distributed machine, and both container orchestration and durable execution are core pieces of that puzzle.
Stop Building Distributed Systems and Start Building Distributed Applications
So Temporal makes programs designed to execute on the cloud easier to write and operate. We don’t have to break our logic down into distinct services, and we can just execute tasks instead of calling out to another service. We also don’t need to worry about saving the state of different running operations as they make their way through various microservices: Temporal takes care of that for us. But why does this mean that we can avoid building application-specific distributed systems?
The reality is that when code executes on Temporal–when Workflows run–the execution of those Workflows is as a distributed system. Temporal makes strong guarantees about program state and communication between functions. You gain all of the advantages of running your program as if it were a distributed system just by being able to trust that your program state is the true state of your execution, despite the fact that it might have happened on some arbitrary set of processes running on some arbitrary set of machines.
It’s not a good use of Temporal, but you could imagine a distributed key-value store that is just a workflow with an in-memory map that has tasks for CRUD operations. The state of that map is durably backed by temporal just as much if each operation was implemented via Paxos and work can be carried out on any number of copies of that application running or any number of machines.
Parting Thoughts
Hopefully this has been an interesting thought experiment. We’ve played with so many different ways to communicate what exactly Temporal is over the years, but for me the parallels with computer architecture have always resonated strongest since my background is in hardware.
As we continue to iterate on Temporal itself and on exactly how to help developers understand its value, I want to leave you with this thought: what if the entire internet were just one big computer? What if the code we write could just run in the cloud without anyone needing to figure out how or where to put anything (or maybe regardless of how or where we run something)? What if what Temporal enables really is a new kind of computing machine where the machine itself is a distributed system?
And what are you most excited about building to run on a distributed machine?
- For our purposes, we’re going to touch on complex topics and stop short of deeply illuminating them for the uninitiated. The goal here is to demonstrate patterns and paradigms in order to relate them to Temporal as enabling a distributed machine, not to teach computer architecture or operating system design. There are exceptions to every rule and leaks to every abstraction. Please give me some leeway to be a little general and hand wavy.
If these concepts pique your curiosity and you want a deeper dive, I highly recommend Computer Architecture: A Quantitative Approach by Hennessy and Patterson (just “Hennessy and Patterson”), The Design of the UNIX Operating System by Bach (I don’t know if this one has a nickname), and Engineering a Compiler by Cooper and Torczon (there’s a third edition that just came out last year I still need to check out). And it’s really got nothing to do with any of this, but while I’m recommending important books in computing I might as well throw in The C Programming Language by Kernighan and Ritchie (just “K&R”), and The Art of Computer Programming by Knuth (it’s long - I’m still working through it).