Control an outage by localizing the failures

466 views Outage Dissections

Outages are inevitable; but we should design our architecture and ensure that if a component is down, it should not lead to a complete outage.

What happened with GitHub?

GitHub saw a lot of failures with their Actions service and this led to delays in queued jobs from being processed. The root cause was some infrastructure error in the SQL layer.

Insights about their architecture

A couple of insights about their architecture

Synchronous dependency

Although GitHub Actions look like a single feature, internally it consists of multiple microservices. Some of these, have a synchronous dependency on the database. Because of this, when the DB had a hiccup, the entire Actions feature was hindered.

Zero trust communication

The service that was most affected in this outage handled communication; why would services need authentication? After all, they are all internal to the infrastructure?

Microservices talk. The communication needs to be protected with auth so that any engineer/service gone rogue cannot abuse the system in any capacity. Only authenticated and authorized services are allowed to take action.

What about automatic failover?

Given that the outage happened on the database layer, why did the database do not auto-recover? It is a standard procedure and configuration that would have just promoted a replica to be the new master.

Although it is a common config, during this outage the metrics did not show any issue with the database, and hence the auto-failover was never triggered. It took a long time to even understand the root cause and then start mitigation.

Long-Term Fixes

Update the automation scripts

The automation that reads the telemetry and decides to do a failover needs to be updated so that such failures are detected and action is taken.

Localizing failures

An important long-term change that needs to be driven is to localize the failure. In this outage, we learned how a hiccup in one database/service causes downtime of all dependent Microservices. This shouldn’t have happened, as the Microservices are supposed to solve this very problem.

A good way to ensure that the blast radius of the outage is minima; is by ensuring the failures are localized, implying, that when a service is down, only the service is affected while everything else is functioning perfectly fine.

A common approach to getting this loose coupling is by powering inter-service communication through the asynchronous medium instead of synchronous API calls. Thus, if something breaks, we could fix it and continue to process the messages.

Arpit Bhayani

Arpit's Newsletter

CS newsletter for the curious engineers

❤️ by 17000+ readers

If you like what you read subscribe you can always subscribe to my newsletter and get the post delivered straight to your inbox. I write essays on various engineering topics and share it through my weekly newsletter.

Other essays that you might like

So, the outage is mitigated, now what?

526 views 25 likes 2022-07-08

Outages happen and in such a tense situation, the main priority is to get the system back up, but is that it? Is everyth...

Control an outage by localizing the failures

466 views 31 likes 2022-07-06

Outages are inevitable; but we should design our architecture such that if a component is down, it should not lead to a ...

Dissecting GitHub Outage - Multiple Leaders in Zookeeper Cluster

1101 views 60 likes 2022-07-01

Distributed Systems are prone to problems that seem very obscure. GitHub had an outage because a set of nodes in the Zoo...

GitHub Outage - How databases are managed in production

1231 views 83 likes 2022-06-29

So, how are databases managed in production? When the master goes down, how a replica is chosen and promoted to be the n...

Be a better engineer

A set of courses designed to make you a better engineer and excel at your career; no-fluff, pure engineering.

System Design Masterclass

A masterclass that helps you become great at designing scalable, fault-tolerant, and highly available systems.

800+ learners

Details →

Designing Microservices

A free playlist to help you understand Microservices and their high-level patterns in depth.

17+ learners

Details →

GitHub Outage Dissections

A free playlist to help you learn core engineering from outages that happened at GitHub.

67+ learners

Details →

Hash Table Internals

A free playlist to help you understand the internal workings and construction of Hash Tables.

25+ learners

Details →

BitTorrent Internals

A free playlist to help you understand the algorithms and strategies that power P2P networks and BitTorrent.

42+ learners

Details →

Topics I talk about

Being a passionate engineer, I love to talk about a wide range of topics, but these are my personal favourites.

Arpit's Newsletter read by 17000+ engineers

🔥 Thrice a week, in your inbox, an essay about system design, distributed systems, microservices, programming languages internals, or a deep dive on some super-clever algorithm, or just a few tips on building highly scalable distributed systems.

  • v12.7.8
  • © Arpit Bhayani, 2022

Powered by this tech stack.