High availability and disaster recovery with Temporal Cloud

Temporal makes your applications more reliable. But from an operational perspective, any complex software is hard to run reliably at scale. In this post, we’ll give a brief overview on the challenges with self-hosting Temporal at scale, and the ways in which Temporal Cloud provides high availability. For more details, you can watch our webinar recording on this topic.

Challenges of maintaining high availability when self-hosting Temporal

The core challenge of achieving high availability with Temporal is that the Service is composed of multiple independently scalable components. You must tune each and maintain their availability:

A database, typically Cassandra or Postgres, which is usually sharded and deployed in a highly available way, preferably across multiple availability zones.
Four independent services that make up the Temporal Server. These services must be resourced properly so there are no bottlenecks in the critical path of serving requests.
As with any distributed system, failures are inevitable, and understanding how to operate under different failure conditions is necessary to keep the service stable and available at all times. Some failures are relatively easy to deal with (a machine going down), while some are subtle and require careful attention (a network partition).

Managing each of these services at smaller scales is straightforward. But to run them at scale in production, you must have a lot of expertise. That’s not to say it’s impossible. Many developers successfully self-host Temporal. But they may have difficulty meeting high availability SLAs, and often spend significant time and resources operating Temporal. For mission-critical applications and high-scale use cases, we always recommend evaluating Temporal Cloud.

High availability with Temporal Cloud

With Temporal Cloud, our team delivers Temporal-as-a-service. We properly tune the supporting database and services for your load, and ensure they’re highly available. Because our team has deep Temporal expertise and manages thousands of namespaces, we can provide better service reliability, higher availability, lower latency, and we have a higher buffer of resources reserved for unexpected events.

As a Temporal Cloud customer, you're only responsible for deploying and managing your Workers and Workflows in your applications, and connecting your application to your managed Temporal Service.

Here are the details of the high availability guarantees Temporal Cloud provides:

Fault tolerance - Temporal Cloud namespaces are deployed across three availability zones for fault tolerance by default. So any AZ failure would be a non-event for your namespace.
99.99% service level objective (SLO) - As a service, Temporal Cloud regularly provides four 9’s of availability; in other words, that’s the availability of the endpoint.
99.9% service level agreement (Contractual SLA) - the Temporal Cloud Contractual SLA is based on the average number of gRPC service errors over five minute intervals for the month. Contractually, if we do not meet this objective, we will issue back Cloud credits based on the outage.

For disaster recovery, Temporal Cloud provides the following:

RTO/RPO for availability zone failures: the RTO/RPO are zero for availability zone failures, due to Temporal Cloud being replicated across multiple availability zones
RTO/RPO for region failures: the RTO/RPO are eight hours at maximum, which is two backup periods for Temporal Cloud.
COMING SOON: Multi-Region Namespaces: currently in pre-release, this capability will provide failover capabilities to mitigate service outages due to regional failures. It will also extend our contractual SLA to 99.99%. With Multi-Region Namespaces, your cloud service will be defined by a primary cloud region and a standby cloud region. History events automatically ship into the standby region asynchronously. In the event of the primary region failure, you can manually switch traffic to the standby region without disrupting ongoing Workflows. We recommend this capability if disruption of your workflow will cause loss of revenue, poor end-user experience, or issues with regulatory compliance.

This is just a brief overview of the topic of high availability in Temporal Cloud. For more details, we recommend watching the webinar: Availability and Disaster Recovery in Temporal Cloud.

This post is part of a series about Temporal Cloud. Check out the other posts below: