Peeking into assembly code to understand why count++ is not atomic

Watch the video explanation ➔

Write an engineering article with a concrete timeline in Markdown format from the following transcript.

so when two threads try to update the same variable the final value may go in an inconsistent State we all know this but why does that happen in this video we'll go deep and understand why a really simple operation like count plus plus is not Atomic in nature why it is not thread safe right so to set a context let's say I have a global variable called count count is set to account is initially set to Value 0. I have a function called increment which increments it by 1 using the using plus plus operator now when I have two threads thread one and thread 2 both are trying to but both are invoking the same increment function which is incrementing the value by 1 I would expect the final value to be 2 but it is possible that my final value is not 2 but instead it is 1. so we have just fired one operation but it is still not Atomic why is this happening for us to understand this it's important to go deep into the Assembly Language code and see what exactly is happening behind the scene right so for us to go deeper and understand why count plus plus is not threat safe we would have to generate the machine language code or the machine code that is generated also called the assembly code which is generated out of our C code right now for us to do that obviously we would have to do a bunch of setup locally but instead I'm just trying to save some hustles and using an online ID called replit repeat is great where you can actually create projects in any language per se here I've just created a very simple replete on which is using CS my language I will just use this to do whatever operation that I would want to so we'll start with writing this exact this really simple code on include stdio.h and Main and count equal to zero and oh sorry not equal to zero and I would do count plus plus simple code which just runs there's nothing fancy over here right so here what we have we are just having this variable called count which is a global variable and then I have a main function which is doing count plus plus now if we run this code it would it would not print anything anyway but what we want to do is we are not interested in running this code we are interested in seeing what happens when we do count plus plus so we would want to generate the assuming language code out of this C code how do we generate that I don't know so replied gives you this very interesting thing called Ghostwriter replied folks have trained their own large language model or which is very specific to programming and computer science and engineering related stuff and it gives really good answers for that so I'll go to plus type Ghostwriter and I'll ask him how to generate machine or assembly language code from C using GCC and it did spit out that hey you have this file and the best part is it gave me in the context of my replet that because I created this replica file called main.c it gave me in context of this file that hey run this command GCC hyphen capital S main dot C this will produce a file name main.s in the same directory as main.c which contains a generated machine assembly code so let's go to console and fire that we see main.s got generated now we click that and to see what's there in main.s okay this is the Assembly Language code that is generated out of this really simple C code that we wrote now here there's a bunch of jargon which we typically don't really need to understand the Crux is we would want to know what's happening by when we do count plus plus count plus plus is part of main we'll look at where the main definition is this is where the main definition is and then here is written right so this is the typical code that is executing within main as part of count plus plus so here what we are doing we have some move Q instruction some Addle some xrl some return we could see count we could see dollar one typically we are doing count plus plus Loop adding one to that and then we are doing xor of something and then returning so these four instructions seems to be the one that is that will execute when we execute the main function with count plus plus in it so all the secrets and all the magic slots should be within this right when I ask Ghostwriter about hey what each of these lenders it gives really beautiful output I would highly recommend you to test that out the best part being that it's very contextual on how this executes in context of your replit it gives really amazing uh it gives really amazing suggestions and descriptions about the things that you would want to know as an engineer so pretty dope there so here let me walk you through on what are these commands that are generated that all of them are doing we see these four commands getting generous when I run count plus plus so let's Understand Each of this command and what it's trying to do the first thing oh basically FYI uh the machine language code that is generated depends on the underlying computer architecture that you have this is as per the underlying architecture that replit runs on if you're using Ubuntu with AMD you might see something like something else as an output but the explanation would remain exactly the same there's no hiccups here or there okay so let's see the four commands that we saw in the replied as an output what they are doing first is move Cube count at the rate got PCR El percentage rip comma percentage RX basically it says this move queue it's basically using the global offset table this got this got is basically the global offset table and from which we are reading count and loading it in Rax RX is register ax right so it means that in this register ax load the address of variable count in that right it's loading the address of count variable not the value of count variable add L dollar one percentage array X which means that whatever RX is pointing to add 1 at that location right so add 1 to the value pointed by add R pointed by the address stored in the RX register then xor percentage e x percentage e x when you xor same value with the same value the output is zero right so this is where we are doing 0 and scoring it in e x and then we are doing red return if you look carefully the return value of the main function is what 0. right so that's why this is how it is returning 0 from there okay so now when we run these two things as two threads at the same time what words could happen these instructions would interleave among themselves so let's see if it leads to an so what we know is we would need a register let's say I create this register called Rax this is my register Rax and then what we want is we have a global variable called count that is stored over here this is your count variable right okay so what we first did first thread came in it loaded rip which is an instruction pointer count Global offset table and load the address of it in Rax so now this Rax contains the address of my Global variable count right okay this is my first thread then we say that this thread lets a context switched with another thread doing executing the same thing what happens when we do a context switch of threads so in as part of context switch all the values of all the registers are stored in the thread State on the CPU this Wayward happens is that when your CPU is doing context with it is taking all the thread state which means all the values are registered and storing it so when you want to resume it it just offloads all of those things on the register and now it can exactly start resume it would exactly resume from the point where it stopped right this is typically what happens when you are doing thread uh when you're doing context switch between threads so when the context switch happened this RX and this so basically RX value got stored somewhere else and then another thread got scheduled which loaded its own RX like on the same register RX it ordered the variable the address of C address of count it did that right so they're not replacing each other kind of writing it there X which happens it stores and then when it is scheduled again it replaces the existing value that is there on the resistor right so both of them have loaded the address of count in their respective RX then the next instruction is ADD L dollar one percentage RX which means whatever RX is pointing to go and increment it by one value of count was 0 then at L dollar one percentage RX would do what it would increment whatever is pointed by RX by one so it did that right then another thread came in which ran at Dollar one percentage RX whatever is pointed by Rax go to that location and do a plus plus so given that even if the context which has happened the value of count is not getting stored it's not for each so each thread the count is initialized the count is global variable so the for the value stored in the RX for thread 2 will be exactly the same so it would go to that location and store this and like increment this but if you look carefully the final value even after this heavy interleaving is still correct it's still true so then why are we saying that count plus plus is not atomic which is where we made a big assumption that the Assembly Language code that is generated is atomic in nature which means each of this instruction is atomic in nature but that's not true because Assembly Language itself is an abstraction which means that the commands the Assembly Language that we are generating over here that we see that we typically see is a human readable Assembly Language code which is generated as part of GCC GCC after this Assembly Language code that we just saw it uses assembler to come to actually create the Machine level code which is then executed on the underlying CPU which is where the instruction like add L which is for human readable it makes it really simple that add this value to this register like the address pointed by this section but internally what is happening is this add L command that we have this Addle command actually splits into three micro operations this one Addle instruction that we see in the Assembly Language output is actually consisting of three micro operations where the first micro operation out of this three and these three micro operations that are executing on the CPU they are Atomic in nature so move L RDI temp so every thread has its own temporary set of resistors when they are executing so now what happens is they are assigned that a register so now they are loading the actual value of count from memory into this temperature into this temporary register then running add L on that one on the temporary register and then move well on that temporary register to percentage RDI so writing it to that memory location so this Addle that simplified output that we got in the Assembly Language is actually these three micro instructions now these three micro instructions on their own are Atomic in nature so a CPU would not context switch when executing mobile while executing Adel while executing mobile but it can contact switch between these instructions which is where this is where we say that they both loaded the old value then they incremented in their own space and then wrote it well this is where it comes from exactly that's the reason why count plus plus is not Atomic but now if you exactly know why that is happening the temporary registers so they read the value in the temporary register even though you are not seeing it in the assembly code that is generated but behind the scene Matrix running it is running in the micro operation so those are Atomic but one adult instruction is not Atomic in nature which is that it's it expands into three instructions micro operations as we may call it and which is why your thread or your count plus plus is not Atomic in nature now you see why this is where when people say that loading the value updating it their own place and then writing it back this is what is happening behind the scenes engineering is so beautiful computer architecture is so beautiful and yeah now you know now you precisely know why count plus plus is not Atomic in nature this was my third video in the concurrency In-Depth series I hope I sparked some engineering curiosity in you try this out I have linked the replit the GitHub repo where all these codes are pushed in the description down below do check that out do try this on your own see what Assembly Language code your compiler generates for your computer architecture you'll have quite fun implementing it just five lines of code right so yeah that's all what I wanted to cover in this one I hope you found it interesting hope you found it amazing that's it for this one I'll see in the next one thanks again [Music]

Here's the video ⤵

Courses I teach

Alongside my daily work, I also teach some highly practical courses, with a no-fluff no-nonsense approach, that are designed to spark engineering curiosity and help you ace your career.


System Design Masterclass

A no-fluff masterclass that helps SDE-2, SDE-3, and above form the right intuition to design and implement highly scalable, fault-tolerant, extensible, and available systems.


Details →

System Design for Beginners

An in-depth and self-paced course for absolute beginners to become great at designing and implementing scalable, available, and extensible systems.


Details →

Redis Internals

A self-paced and hands-on course covering Redis internals - data structures, algorithms, and some core features by re-implementing them in Go.


Details →


Writings and Learnings

Knowledge Base

Bookshelf

Papershelf


Arpit's Newsletter read by 100,000 engineers

Weekly essays on real-world system design, distributed systems, or a deep dive into some super-clever algorithm.