Time-Travel Debugging Production Code

In this post, I’ll give an overview of time-travel debugging (what it is, its history, how it’s implemented) and show how it relates to debugging your production code.

Normally, when we use debuggers, we set a breakpoint on a line of code, we run our code, execution pauses on our breakpoint, we look at values of variables and maybe the call stack, and then we manually step forward through our code's execution. In time-travel debugging, also known as reverse debugging, we can step backward as well as forward. This is powerful because debugging is an exercise in figuring out what happened: traditional debuggers are good at telling you what your program is doing right now, whereas time-travel debuggers let you see what happened. You can wind back to any line of code that executed and see the full program state at any point in your program’s history.

History and current state

It all started with Smalltalk-76, developed in 1976 at Xerox PARC. (Everything started at PARC 😄.) It had the ability to retrospectively inspect checkpointed places in execution. Around 1980, MIT added a "retrograde motion" command to its DDT debugger, which gave a limited ability to move backward through execution. In a 1995 paper, MIT researchers released ZStep 95, the first true reverse debugger, which recorded all operations as they were performed and supported stepping backward, reverting the system to the previous state. However, it was a research tool and not widely adopted outside academia.

ODB, the Omniscient Debugger, was a Java reverse debugger that was introduced in 2003, marking the first instance of time-travel debugging in a widely used programming language. GDB (perhaps the most well-known command-line debugger, used mostly with C/C++) added it in 2009.

Now, time-travel debugging is available for many languages, platforms, and IDEs, including:

Replay for JavaScript in Chrome, Firefox, and Node, and Wallaby for tests in Node
WinDbg for Windows applications
rr for C, C++, Rust, Go, and others on Linux
Undo for C, C++, Java, Kotlin, Rust, and Go on Linux
Various extensions (often rr- or Undo-based) for Visual Studio, VS Code, JetBrains IDEs, Emacs, etc.

Implementation techniques

There are three main approaches to implementing time-travel debugging:

Record & Replay: Record all non-deterministic inputs to a program during its execution. Then, during the debug phase, the program can be deterministically replayed using the recorded inputs in order to reconstruct any prior state.
Snapshotting: Periodically take snapshots of a program's entire state. During debugging, the program can be rolled back to these saved states. This method can be memory-intensive because it involves storing the entire state of the program at multiple points in time.
Instrumentation: Add extra code to the program that logs changes in its state. This extra code allows the debugger to step the program backwards by reverting changes. However, this approach can significantly slow down the program's execution.

rr uses the first (the rr name stands for Record and Replay), as does Replay. WinDbg uses the first two, and Undo uses all three (see how it differs from rr).

Time-traveling in production

Traditionally, running a debugger in prod doesn't make much sense. Sure, we could SSH into a prod machine and start the process handling requests with a debugger and a breakpoint, but once we hit the breakpoint, we're delaying responses to all current requests and unable to respond to new requests. Also, debugging non-trivial issues is an iterative process: we get a clue, we keep looking and find more clues; discovery of each clue is typically rerunning the program and reproducing the failure. So, instead of debugging in production, what we do is replicate on our dev machine whatever issue we're investigating and use a debugger locally (or, more often, add log statements 😄), and re-run as many times as required to figure it out. Replicating takes time (and in some cases a lot of time, and in some cases infinite time), so it would be really useful if we didn't have to.

While running traditional debuggers doesn't make sense, time-travel debuggers can record a process execution on one machine and replay it on another machine. So we can record (or snapshot or instrument) production and replay it on our dev machine for debugging (depending on the tool, our machine may need to have the same CPU instruction set as prod). However, the recording step generally doesn't make sense to use in prod given the high amount of overhead—if we set up recording and then have to use ten times as many servers to handle the same load, whoever pays our AWS bill will not be happy 😁.

But there are a couple scenarios in which it does make sense:

Undo only slows down execution 2–5x, so while we don't want to leave it on just in case, we can turn it on temporarily on a subset of prod processes for hard-to-repro bugs until we have captured the bug happening, and then we turn it off.
When we're already recording the execution of a program in the normal course of operation.

The rest of this post is about #2, which is a way of running programs called durable execution.

Durable execution

What's that?

First, a brief backstory. After Amazon (one of the first large adopters of microservices) decided that using message queues to communicate between services was not the way to go (hear the story first-hand here), they started using orchestration. And once they realized defining orchestration logic in YAML/JSON wasn't a good developer experience, they created AWS Simple Workfow Service to define logic in code. This technique of backing code by an orchestration engine is called durable execution, and it spread to Azure Durable Functions, Cadence (used at Uber for > 1,000 services), and Temporal (used by Stripe, Netflix, Datadog, Snap, Coinbase, and many more).

Durable execution runs code durably—recording each step in a database, so that when anything fails, it can be retried from the same step. The machine running the function can even lose power before it gets to line 10, and another process is guaranteed to pick up executing at line 10, with all variables and threads intact.¹ It does this with a form of record & replay: all input from the outside is recorded, so when the second process picks up the partially-executed function, it can replay the code (in a side-effect–free manner) with the recorded input in order to get the code into the right state by line 10.

Durable execution's flavor of record & replay doesn't use high-overhead methods like software JIT binary translation, snapshotting, or instrumentation. It also doesn't require special hardware. It does require one constraint: durable code must be deterministic (i.e., given the same input, it must take the same code path). So it can't do things that might have different results at different times, like use the network or disk. However, it can call other functions that are run normally ("volatile functions", as we like to call them 😄), and while each step of those functions isn't persisted, the functions are automatically retried on transient failures (like a service being down).

Only the steps that require interacting with the outside world (like calling a volatile function, or calling sleep('30 days'), which stores a timer in the database) are persisted. Their results are also persisted, so that when you replay the durable function that died on line 10, if it previously called the volatile function on line 5 that returned "foo", during replay, "foo" will immediately be returned (instead of the volatile function getting called again). While yes, it adds latency to be saving things to the database, Temporal supports extremely high throughput (tested up to a million recorded steps per second). And in addition to function recoverability and automatic retries, it comes with many more benefits, including extraordinary visibility into and debuggability of production.

Debugging prod

With durable execution, we can read through the steps that every single durable function took in production. We can also download the execution’s history, checkout the version of the code that's running in prod, and pass the file to a replayer (Temporal has runtimes for Go, Java, JavaScript, Python, .NET, and PHP) so we can see in a debugger exactly what the code did during that production function execution. Read this post or watch this video to see an example in VS Code.²

Being able to debug any past production code is a huge step up from the other option (finding a bug, trying to repro locally, failing, turning on Undo recording in prod until it happens again, turning it off, then debugging locally). It's also a (sometimes necessary) step up from distributed tracing.

💬 Discuss on Hacker News, Reddit, Twitter, or LinkedIn.

I hope you found this post interesting! If you'd like to learn more about durable execution, I recommend reading:

and watching:

Thanks to Greg Law, Jason Laster, Chad Retz, and Fitz for reviewing drafts of this post.