Dissecting GitHub Outage - Repository Creation Failed

522 views Outage Dissections

Just imagine you trying to create a repository on GitHub and it is not working, and this happened to GitHub in April 2021 when their users were not able to create a new repository.

The root cause for this outage was something that seems unrelated - Scanning Secrets. The root cause makes this outage super interesting to dissect.

What is Secret Scanning?

Our API servers need to talk to peripheral components like Databases, Cache, SaaS services, etc. This communication involves some sort of authentication and authorization through auth tokens, passwords, or secret keys.

Developers tend to commit the secrets in the settings/constant files and push them to GitHub. What if the repository content gets leaked? What if GitHub itself has a data breach and the attacker gets access to the private repositories?

If the secrets like AWS access keys, auth tokens, and DB passwords are leaked and the attacker can then get the dump of the data and ask for a ransom. Or they may even abuse the infrastructure to perform some illegal activities or mine cryptocurrencies.

Hence, GitHub periodically runs a job that checks all the repositories for any secrets that are committed and warns the user about it.

Repository Creation Flow

When a repository is created an entry is made into the Secret Scanning table which is then used by a job that scans for potential secrets and notifies the owner.

What led to the outage?

The GitHub team ran a data migration in which they moved the Secret Scanning table from a common database to its own cluster allowing it to scale independently.

GitHub team was unaware of this dependency! and hence after the migration of the table happened to a different database the creation of a new repository started failing to lead to this outage. It is interesting to see such mature products having blindspots.

How did GitHub mitigate it?

The mitigation strategy of GitHub was to roll back the migration. Although it is unclear from the incident report on what exactly they did but there are a few speculations

  1. they could have recopied the table quickly to the old database
  2. whitelisted the database so that applications could connect
  3. old table would have been intact and hence they would have just renamed and made it active again.

Again, it is pure speculation given we do not have any insider information nor they specified in the report. It would have been fun to have gone through their actual mitigation steps. We could have learned so much, but nonetheless, we did learn a few interesting insights from this outage.

Arpit Bhayani

Arpit's Newsletter

CS newsletter for the curious engineers

❤️ by 17000+ readers

If you like what you read subscribe you can always subscribe to my newsletter and get the post delivered straight to your inbox. I write essays on various engineering topics and share it through my weekly newsletter.

Other essays that you might like

So, the outage is mitigated, now what?

500 views 24 likes 2022-07-08

Outages happen and in such a tense situation, the main priority is to get the system back up, but is that it? Is everyth...

Control an outage by localizing the failures

444 views 31 likes 2022-07-06

Outages are inevitable; but we should design our architecture such that if a component is down, it should not lead to a ...

Dissecting GitHub Outage - Multiple Leaders in Zookeeper Cluster

1059 views 58 likes 2022-07-01

Distributed Systems are prone to problems that seem very obscure. GitHub had an outage because a set of nodes in the Zoo...

GitHub Outage - How databases are managed in production

1165 views 81 likes 2022-06-29

So, how are databases managed in production? When the master goes down, how a replica is chosen and promoted to be the n...

Be a better engineer

A set of courses designed to make you a better engineer and excel at your career; no-fluff, pure engineering.

System Design Masterclass

A masterclass that helps you become great at designing scalable, fault-tolerant, and highly available systems.

Enrolled by 700+ learners

Details →

Designing Microservices

A free course to help you understand Microservices and their high-level patterns in depth.

Enrolled by 17+ learners

Details →

GitHub Outage Dissections

A free course to help you learn core engineering from outages that happened at GitHub.

Enrolled by 67+ learners

Details →

Hash Table Internals

A free course to help you learn core engineering from outages that happened at GitHub.

Enrolled by 25+ learners

Details →

BitTorrent Internals

A free course to help you understand the algorithms and strategies that power P2P networks and BitTorrent.

Enrolled by 42+ learners

Details →

Topics I talk about

Being a passionate engineer, I love to talk about a wide range of topics, but these are my personal favourites.

Arpit's Newsletter read by 17000+ engineers

🔥 Thrice a week, in your inbox, an essay about system design, distributed systems, microservices, programming languages internals, or a deep dive on some super-clever algorithm, or just a few tips on building highly scalable distributed systems.

  • v12.4.4
  • © Arpit Bhayani, 2022

Powered by this tech stack.