Dissecting GitHub Outage: Downtime due to an Edge Case

1173 views Outage Dissections

An edge case took down GitHub 🤯

GitHub experienced an outage where their MySQL database went into a degraded state. Upon investigation, it was found out that the outage happened because of an edge case. So, how can an edge case take down a database?

What happened?

The outage happened because of an edge case which lead to the generation of an inefficient SQL query that was executed very frequently on the database. The database was thus put under a massive load which eventually made it crash leading to an outage.

Could retry have helped?

Automatic retries always help in recovering from a transient issue. During this outage, retries made things worse. Automatic retries added the load on the database that was already under stress.

Fictional Example

Now, we take a look at a fictional example where an edge case could potentially take down a DB.

Say, we have an API that returns the number of commits made by a user in the last n days. The way, this API could be implemented is to get the start_date as an integer through the query parameter, and the API server could then fire a SQL query like

SELECT count(id) FROM commits
WHERE user_id = 123 AND start_time > start_time

In order to fire the query, we convert the string start_time to an integer, create the query, and then fire it. In the regular case, we get the correct input and then compute the number of commits and respond.

But as an edge case, what if we do not get the query parameter or we get a non-integer value; then depending on the language at hand we may actually use the default integer value like 0 as our start_time.

There is a very high chance of this happening when we are using Golang which uses 0 as the default integer value. In such a case, the query that gets executed would be

SELECT count(id) FROM commits
WHERE user_id = 123 AND start_time > 0

The above query when executed iterates through all the rows of the table for a particular user, instead of the rows for the last 7 days; making it super inefficient and expensive. The above query would put a huge load on the database and a frequent invocation can actually take down the entire database.

Ways to avoid such situations

  1. Always sanitize the input before executing the query
  2. Put guard rails that prevent you from iterating the entire table. For example: putting LIMIT 1000 would have made you iterate over 1000 rows in the worst case.

Arpit Bhayani

Arpit's Newsletter

CS newsletter for the curious engineers

❤️ by 17000+ readers

If you like what you read subscribe you can always subscribe to my newsletter and get the post delivered straight to your inbox. I write essays on various engineering topics and share it through my weekly newsletter.

Other essays that you might like

So, the outage is mitigated, now what?

500 views 24 likes 2022-07-08

Outages happen and in such a tense situation, the main priority is to get the system back up, but is that it? Is everyth...

Control an outage by localizing the failures

444 views 31 likes 2022-07-06

Outages are inevitable; but we should design our architecture such that if a component is down, it should not lead to a ...

Dissecting GitHub Outage - Multiple Leaders in Zookeeper Cluster

1059 views 58 likes 2022-07-01

Distributed Systems are prone to problems that seem very obscure. GitHub had an outage because a set of nodes in the Zoo...

GitHub Outage - How databases are managed in production

1165 views 81 likes 2022-06-29

So, how are databases managed in production? When the master goes down, how a replica is chosen and promoted to be the n...

Be a better engineer

A set of courses designed to make you a better engineer and excel at your career; no-fluff, pure engineering.

System Design Masterclass

A masterclass that helps you become great at designing scalable, fault-tolerant, and highly available systems.

Enrolled by 700+ learners

Details →

Designing Microservices

A free course to help you understand Microservices and their high-level patterns in depth.

Enrolled by 17+ learners

Details →

GitHub Outage Dissections

A free course to help you learn core engineering from outages that happened at GitHub.

Enrolled by 67+ learners

Details →

Hash Table Internals

A free course to help you learn core engineering from outages that happened at GitHub.

Enrolled by 25+ learners

Details →

BitTorrent Internals

A free course to help you understand the algorithms and strategies that power P2P networks and BitTorrent.

Enrolled by 42+ learners

Details →

Topics I talk about

Being a passionate engineer, I love to talk about a wide range of topics, but these are my personal favourites.

Arpit's Newsletter read by 17000+ engineers

🔥 Thrice a week, in your inbox, an essay about system design, distributed systems, microservices, programming languages internals, or a deep dive on some super-clever algorithm, or just a few tips on building highly scalable distributed systems.

  • v12.4.4
  • © Arpit Bhayani, 2022

Powered by this tech stack.