Dissecting GitHub Outage Downtime due to creating an Index

927 views Outage Dissections

Imagine you created an index on a table and instead of boosting the performance, it lead to an outage 🤦‍♂️ GitHub ran a migration to reverse an index and it lead to a 60 mins outage.

Note: The example we have taken is pure speculation, the official incident report had minimal information about the outage. But the write-up will make you aware of possible challenges that might come during such situations.

Reverse an index

Reversing the order of the index is done when we have a multi-column index and the query requires different sorting orders on them; for example, DESC on date and ASC on user_id.

For a query to be optimally executed on the database, we would need an index that physically stores the index ordered by the date in the descending order and user_id in the ascending order.

MySQL by default stores any index in ASC order and hence, GitHub had to run the migration to reverse the order of the index and gain a boost.

What could go wrong?

A reverse index would require a Full Table Scan during the creation putting a load on the database. Also, upon changing the order of the index, it is possible we overlook another query that is more frequent but optimal with the old order.

The database does its best to create an optimal execution plan and it might not use the reverse index we just created. We can solve this by specifying Index Hints like USE INDEX and FORCE INDEX, ensuring that it uses our index to evaluate the query.

Cascading Effect

Because one of the queries was doing a Full Table Scan, it put a load on the database which had a cascading effect on the service eventually propagating to the end user. All the intermediate services will timeout giving a degraded experience to the user.

Key Takeaways

Never blindly trust ORM

ORMs are designed to make our lives simpler but they might not generate the most optimal queries, and hence it is always better to periodically audit the queries and ensure they are optimal.

Poorly generated queries will put a load on the database choking the entire performance.

Check the query execution plan

While updating a query or changing a schema always check the query execution plan. We can get the execution plan for any query using the EXPLAIN statement.

The diff in the plan would give tell us if any of our queries would perform a full table scan.

Audit the queries and indexes

Keep an inventory of the queries we fire and the indexes it uses during execution. So, whenever we change any index, we can quickly run an audit and ensure zero regressions.

Arpit Bhayani

Arpit's Newsletter

CS newsletter for the curious engineers

❤️ by 17000+ readers

If you like what you read subscribe you can always subscribe to my newsletter and get the post delivered straight to your inbox. I write essays on various engineering topics and share it through my weekly newsletter.

Other essays that you might like

So, the outage is mitigated, now what?

526 views 25 likes 2022-07-08

Outages happen and in such a tense situation, the main priority is to get the system back up, but is that it? Is everyth...

Control an outage by localizing the failures

466 views 31 likes 2022-07-06

Outages are inevitable; but we should design our architecture such that if a component is down, it should not lead to a ...

Dissecting GitHub Outage - Multiple Leaders in Zookeeper Cluster

1101 views 60 likes 2022-07-01

Distributed Systems are prone to problems that seem very obscure. GitHub had an outage because a set of nodes in the Zoo...

GitHub Outage - How databases are managed in production

1231 views 83 likes 2022-06-29

So, how are databases managed in production? When the master goes down, how a replica is chosen and promoted to be the n...

Be a better engineer

A set of courses designed to make you a better engineer and excel at your career; no-fluff, pure engineering.

System Design Masterclass

A masterclass that helps you become great at designing scalable, fault-tolerant, and highly available systems.

800+ learners

Details →

Designing Microservices

A free playlist to help you understand Microservices and their high-level patterns in depth.

17+ learners

Details →

GitHub Outage Dissections

A free playlist to help you learn core engineering from outages that happened at GitHub.

67+ learners

Details →

Hash Table Internals

A free playlist to help you understand the internal workings and construction of Hash Tables.

25+ learners

Details →

BitTorrent Internals

A free playlist to help you understand the algorithms and strategies that power P2P networks and BitTorrent.

42+ learners

Details →

Topics I talk about

Being a passionate engineer, I love to talk about a wide range of topics, but these are my personal favourites.

Arpit's Newsletter read by 17000+ engineers

🔥 Thrice a week, in your inbox, an essay about system design, distributed systems, microservices, programming languages internals, or a deep dive on some super-clever algorithm, or just a few tips on building highly scalable distributed systems.

  • v12.7.8
  • © Arpit Bhayani, 2022

Powered by this tech stack.