Higher throughput and lower latency: Temporal Cloud’s custom persistence layer

Latency drives decisions when evaluating managed services like Temporal Cloud. Many developers assume application latency will increase when they migrate from a self-hosted Temporal Cluster to Temporal Cloud. We see the opposite: Temporal Cloud provides better performance and lower latency than most self-hosted clusters.

These performance improvements stem from our team’s effective management and scaling of Temporal Cloud, and Temporal Cloud’s custom persistence layer. This architecture helps our managed service handle high throughput with lower, more stable, request latencies.

This post provides an overview of Temporal Cloud’s custom persistence layer. For more details, watch the webinar recording on this topic.

Why we built a custom persistence layer

Temporal Cloud is multi-tenant: multiple Namespaces sharing the same compute and persistence layer. Customers pay for consumption instead of entire sets of hardware, a cost-effective solution. Multi-tenancy ensures extra capacity is available for all customers during traffic spikes. Multi-tenancy also means handling the challenge of noisy neighbors, which is when high-traffic tenants consume excess resources, causing slower performance for other tenants.

The noisy neighbor problem is especially difficult to address in the database. Because databases are stateful, and take longer to scale, capacity cannot be added quickly to handle spikes in load. Temporal uses a write-heavy workload; changes in execution state are constantly written to the persistence layer. This lets Workflows execute durably, even when failures occur. The database for Temporal Cloud must support reliably high throughput with low latency for multiple customers, concurrently and fairly.

To address these challenges, our team built a custom persistence solution on top of Temporal Cloud’s existing architecture. Our design includes three pillars:

Better sharding
Write-ahead log
Tiered storage of Workflow Event History

Better sharding in Temporal Cloud

The first thing we did to improve scalability of Temporal Cloud was to shard the persistent state and store it across multiple databases. We can dynamically add databases and resize them independently depending on the needs of different Namespaces. This architecture helps Temporal Cloud handle high scale on a daily basis and scale Namespaces for high-traffic events such as Black Friday.

Write-ahead log in Temporal Cloud

Temporal’s write-heavy nature means every event must be written to the database. A high write rate can cause high latency and require a larger database to support. We addressed this by building a write-ahead log (WAL).

Our write-ahead log stores writes in an append-only log. This allows the server to accumulate multiple updates in the WAL before writing a single aggregated update to the database. If a failure occurs before updates are written to the database, the updates can be read and recovered from the log.

After implementing the write-ahead log, we saw an immediate impact on Temporal Cloud. This enhancement significantly reduced both latency and the size of the databases in Temporal Cloud.

Tiered storage of Workflow Event History

The Temporal persistence layer stores the ongoing writes for every event in a Workflow and the Workflow Event History. The Event History is fundamental to the recovery and replay processes in Temporal. It also lets developers debug past executions or export data for compliance and further analysis. Event Histories consume storage in the database and may slow down performance.

To address this, we built a system that moves Workflow Event History to an object store when the corresponding Workflow Execution completes. Customers can still access Event Histories as normal, while on the backend we reduce demands on the database, improving efficiency and lowering the latency of starting and running Workflows.

The result: higher scalability and lower latency

We’ve helped many customers move from self-hosted Temporal Clusters to Temporal Cloud. We regularly witness decreases in latency as a result of the custom persistence layer. When you’re running at large scale, we always recommend evaluating Temporal Cloud for the best possible performance. From small to large, Temporal Cloud saves you time and resources, and gives you a more reliable service.

For more details, watch our webinar recording: Custom Persistence Layer of Temporal Cloud. You can learn more about Temporal Cloud’s latency Service Level Objective (SLO) here.

Also, check out this talk from last year’s Replay Conference: What's cloud got to do with it? A novel persistence layer for Temporal Cloud.

This post is part of a series about Temporal Cloud. Check out the other posts below: