Be excited when production goes down

Arpit Bhayani

curious, tinkerer, and explorer


I get excited whenever there is a production outage, because I know I will be learning something new and interesting very soon. I always hop on the call, even when I am not on call :)

Being part of stressful situations and seeing how seniors fixed the issues and operated with a calm head made me a better engineer. Reading old postmortem documents (RCAs) of outages at all my workplaces became a habit.

If you are an engineer early in your career, do not shy away from being on-call and spend some time reading RCAs. They are filled with interesting, practical, real-world insights about the systems, coding blunders, tuning parameters, etc.

To be very honest, you will learn more from being on-call than literally anything out there.

By the way, I have a massive playlist on my YouTube. I dissected 18 production outages of GitHub, Atlassian, Spotify, etc., and went deeper into the details not even mentioned in their blogs. The link to the playlist is in the comments, in case you want to check them out.

One interesting outage that happened at GitHub happened because their primary key went beyond the max value, and it had a ton of details on how they fixed it. It had a ton of details about running data migrations without downtimes.

Arpit Bhayani

Creator of DiceDB, ex-Google Dataproc, ex-Amazon Fast Data, ex-Director of Engg. SRE and Data Engineering at Unacademy. I spark engineering curiosity through my no-fluff engineering videos on YouTube and my courses


Arpit's Newsletter read by 100,000 engineers

Weekly essays on real-world system design, distributed systems, or a deep dive into some super-clever algorithm.