1986 views • Outage Dissections
On the 5th of May, 2020, GitHub experienced an outage because of this very reason. One of their shared table having an auto-incrementing ID column hits its max limit. Let’s see what could have been done in such a situation.
GitHub used 4 bytes signed integer as their ID column, which means the value can go from
2147483647. So, now when the ID column hits
2147483647 and tries to get the next value, it gets the same value again, i.e.,
So, when it tries to insert the row with ID
2147483647, it gets the Duplicate Key Error given that a row already exists with the same ID.
A situation like this is extremely critical given that the database is not allowing us to insert any row in the table. This typically results in a major downtime of a few hours, and it depends on the amount of data in the table. There are a couple of ways to mitigate the issue.
Quickly fire the
ALTER table and change the data type of the ID column to
UNSIGNED INT or
BIGINT. Depending on the data size, an ALTER query like this will take a few hours to a few days to execute. Hence this approach is suitable only when the table size is small.
The idea here is to create an empty table with the same schema but a larger ID range that starts from
2147483648. Then rename this new table to the old one and start accepting writes. Then slowly migrate the data from the old table to this new one. This approach can be used when you can live without the data for a few days.
Although mitigation is great, it is better to place a monitoring system that raises an alert when the ID reaches 70% of its range. So, write a simple DB monitoring service that periodically checks this by firing a query on the database.
If you like what you read subscribe you can always subscribe to my newsletter and get the post delivered straight to your inbox. I write essays on various engineering topics and share it through my weekly newsletter.
444 views • 31 likes • 2022-07-06
Outages are inevitable; but we should design our architecture such that if a component is down, it should not lead to a ...
1059 views • 58 likes • 2022-07-01
Distributed Systems are prone to problems that seem very obscure. GitHub had an outage because a set of nodes in the Zoo...
1165 views • 81 likes • 2022-06-29
So, how are databases managed in production? When the master goes down, how a replica is chosen and promoted to be the n...
A set of courses designed to make you a better engineer and excel at your career; no-fluff, pure engineering.
Being a passionate engineer, I love to talk about a wide range of topics, but these are my personal favourites.
Arpit's Newsletter read by 17000+ engineers
🔥 Thrice a week, in your inbox, an essay about system design, distributed systems, microservices, programming languages internals, or a deep dive on some super-clever algorithm, or just a few tips on building highly scalable distributed systems.
Powered by this tech stack.