894 views • Outage Dissections
Imagine you created an index on a table and instead of boosting the performance, it lead to an outage 🤦♂️ GitHub ran a migration to reverse an index and it lead to a 60 mins outage.
Note: The example we have taken is pure speculation, the official incident report had minimal information about the outage. But the write-up will make you aware of possible challenges that might come during such situations.
Reversing the order of the index is done when we have a multi-column index and the query requires different sorting orders on them; for example,
For a query to be optimally executed on the database, we would need an index that physically stores the index ordered by the
date in the descending order and
user_id in the ascending order.
MySQL by default stores any index in
ASC order and hence, GitHub had to run the migration to reverse the order of the index and gain a boost.
A reverse index would require a Full Table Scan during the creation putting a load on the database. Also, upon changing the order of the index, it is possible we overlook another query that is more frequent but optimal with the old order.
The database does its best to create an optimal execution plan and it might not use the reverse index we just created. We can solve this by specifying Index Hints like
USE INDEX and
FORCE INDEX, ensuring that it uses our index to evaluate the query.
Because one of the queries was doing a Full Table Scan, it put a load on the database which had a cascading effect on the service eventually propagating to the end user. All the intermediate services will timeout giving a degraded experience to the user.
ORMs are designed to make our lives simpler but they might not generate the most optimal queries, and hence it is always better to periodically audit the queries and ensure they are optimal.
Poorly generated queries will put a load on the database choking the entire performance.
While updating a query or changing a schema always check the query execution plan. We can get the execution plan for any query using the
The diff in the plan would give tell us if any of our queries would perform a full table scan.
Keep an inventory of the queries we fire and the indexes it uses during execution. So, whenever we change any index, we can quickly run an audit and ensure zero regressions.
If you like what you read subscribe you can always subscribe to my newsletter and get the post delivered straight to your inbox. I write essays on various engineering topics and share it through my weekly newsletter.
444 views • 31 likes • 2022-07-06
Outages are inevitable; but we should design our architecture such that if a component is down, it should not lead to a ...
1059 views • 58 likes • 2022-07-01
Distributed Systems are prone to problems that seem very obscure. GitHub had an outage because a set of nodes in the Zoo...
1165 views • 81 likes • 2022-06-29
So, how are databases managed in production? When the master goes down, how a replica is chosen and promoted to be the n...
A set of courses designed to make you a better engineer and excel at your career; no-fluff, pure engineering.
Being a passionate engineer, I love to talk about a wide range of topics, but these are my personal favourites.
Arpit's Newsletter read by 17000+ engineers
🔥 Thrice a week, in your inbox, an essay about system design, distributed systems, microservices, programming languages internals, or a deep dive on some super-clever algorithm, or just a few tips on building highly scalable distributed systems.
Powered by this tech stack.