Bloom Filters - Live Coding

Watch the video explanation ➔

Write an engineering article with a concrete timeline in Markdown format from the following transcript.

and we are like nice little desktop superb maybe wait for a couple of minutes before we start hi there was going good going good man we pull it won't be as good as yours I saw us yesterday oh I was so happy to see you influented interest but yeah uh and uh definitely I actually went through yours great great okay by the way I'll be making a lot of mistakes so uh but I've planned something few few very interesting optimizations that are very recently stumbled upon so I'll try to cover most of them great we'll get started uh so by the first of all thank you so much for tuning in Friday evening I know it's pretty hard time but I had to do something for some people I and and I and I always wanted to live stream shut I'll create my ripple and uh by the way I thought of setting up GoGo locally vs code few people were not happy with the font so I'm like hey screw it let me use replete for this I just love this platform for some reason uh nose hassles no setup thing does and public reflect God and I'll just link this in the description so in case you would want to copy paste uh basically follow along you can do that saved and I'll just copy paste my screen link as well this is stupid things cool and uh very streaming and script number copies I'm gonna get started uh I'll just read to it that I'm going live just in case anyone else would want to you know wait going like I am like Chuck let's get started uh I'll choose golai I thought CEC let's do c but then I like no no no go live it will be when the man I'll just so basically while I'm doing this I also talk about the thought process that I'm going through so that uh in case uh like and and I'm sure it would help you all to understand how to operate spring structure and uh I'll make mistakes bear with me uh so I don't have superpowers that I can code seamlessly fine plus I cannot speak and talk uh so speak and code at the same time so I'll do my level best I'll just see if my replicate is property set up it is indeed and five okay so yeah uh first of all I'm using the replate as my ID so that I uh has so I don't have a lot of hassles in setting up IDs and all I could switch any language from one to another seamlessly so one of the reasons why I love using replica like for most of my prototypes like this I've linked it in the description down below you may want to uh hop along see what dot you can do that okay fine so Bloom filters I hope most of you are aware what Bloom filters are uh in case not just a quick refresher uh it's a privacy data structure um basically whenever you hear the term filter it is mostly a set of bits or rather an array of bits and what it does is it does it gives you false positives but it definitely does not gives you false negative it's a very it's a it's a very theoretical way to say that it is possible like when I add something to the bloom filter and I ask that hey does this item exist or not bloom filter might say yes even though the item is not there right so that's the false positive case that it throws at us while the false negative case being if it is not there then it would it would never say it is there right so it would it would never have false negatives so what does a bloom filter have and uh by the way replied has this Great Ghost Writer that I will use it's like hyper optimized for basically coding things all things computer science so I love that uh so I'll just keep asking uh Ghost Writer bunch of questions hoping it would answer those things correctly and while I could okay so Bloom filter Bloom filter it's a array now here is the first uh first like really interesting choice over here is uh Bloom filter it's an array of bytes what we want to do is I've seen and for some reason I've seen a lot of implementations of Bloom filter recently everyone seems to be using array of booleans for example what they would do is they would create a filter they would create a filter which is array of booleans now let's say if I want to create a Boolean array let me do that I'll do make area of Boolean next I create I want to have like 10 000 booleans because at 10 what we'll do is we'll take the value we will pass it through the hash function and we will set a corresponding bit to true or false over here in the bloom filter so a lot of people just use now the key thing but Boolean is not one bit Adder where we say Bloom filter we take the value that we would want to put into the bloom filter and when we say we are putting it into the blue filter we are setting that corresponding bit to 1 or 0. so when you're setting the corresponding bit to 1 if you use a Boolean array that Boolean is definitely not one bit let's ask Ghostwriter how big is a pool in Colac I'm not sure how big it is if it's one bit then that's why it's one byte so it says that it's one byte so if I use array uh if I want to have a bloom filter of length 1000 which means I could have 1000 booleans it would take me 1000 bytes to store it will do hyper optimized version of this stuff okay we'll not use error Boolean because each because all we need is to store one and zero right so by ways to seven bits out of it so let's do it this way let's go for a bite buffer let's go bite buffer of size will start small we'll start with 16. and we'll start with a small uh Bloom filter and what I would want to do is we'll take some test cases or other keys that we would want to put in Keys array of string I'll say a b and c In Bloom filter what we do is we take for key in range piece I write over the key and add it to bloom filter so Bloom dot add and key K all right I'll add key now let's define blue blue is struck blue or other Capital plume filter type straw Bloom filter and I'll have to filter mail store filter as some element I'll store filter as array of byte and I'll have an init method so uh suggestion scary man it it actually very accurately suggested what we should write new Bloom filter size ain't start Bloom with her I got AI will take over oh yeah that's insane okay fine uh okay so when we Define the blue filter we definitely need to tell how big the blue filter should be so what we'll do is we'll tell that hey I want to unlock a doom fitter that would hold these many bytes so I'll pass that as an argument as recommended and it automatically did a make of it but we'll write a little beautiful code over here so uh okay so this is how we will initialize the bloom filter I did not write Ghost Writer did what uh fancy stuff and and I hope and I'm sure basically GitHub profiler does pretty similar things but yeah it understood I'm going to create a blue filter and water then it just did it fine so I want to have an add method on top of this as well so Bloom filter gave me a bloom filter object I'll have an add method on it and on Bloom filter I'll have an add method which takes a screen key oh it takes a key of type string and does nothing with it and In Bloom filter I would want to do call add of it so I'll change this function to add and Bloom is equal to new Bloom filter let's say create of size 60. so we'll go with limited seating there Bloom filter I've created a new blue filter of size 16 and I'm calling add on it okay now structurally it should we have almost everything function unreported and used and I'll just quickly run and check if syntactically is all that natural it's good fine okay now we have the basic structure in place and now what I'll also do is I'll also like assume we have added something I'll just iterate over the keys and I'll also check if it exists or not so exist and fmt println to say key and if it exists or not so what I typically do is I have a habit of I first clear up all the inputs and outputs thing and then I just solely focus on business logic so that's what I'm doing so I just created a small uh structural thing there that hey I have a structure where I am storing some filter I'll have an add function and now I'll have an exists function that returns me a bull which says that hey does this key exists in the bloom filter or not let's say return false okay and then I will have this I will do formatting I want to import fmtf empty it's not imported get and I have done okay five uh a false B Falls C false okay fine now it's all business logic fine sure now when I'm trying to add something to the bloom filter what I need to do is we take the key yeah don't write code because writer is writing code for me so I want to write code but uh it's pretty fascinating how much code AI can write for us but more of more often than not it just it just uh basically makes us quite a bit productive there fine uh what next what I was going to do huh adding adding to the blue filter fine so we what you want to do is we take the key we want to pass it through the hash function okay we'll take the key pass it through the hash function and uh we'll get in some number will have to mod it by size and that corresponding bit needs to be set okay now now interesting thing starts to creep up okay fine so let's start with the easy implementation then we'll make it hyper optimized so instead of doing uh size of or us or an array of bytes let's do Arab Boolean so that first we have a working Bloom filter and then we'll optimize it so let me just have array of Boolean instead of Arrow byte and what we'll do is we want to have a hash function let's ask Ghostwriter which is the best hash function for computer which has function should be used for blue filter and it spits out murmur hash fine reduce murmur hash okay any goal line Library yeah it's pretty good code again man quite interesting fine uh so Marvel hash let's quickly check what's the key property of multiple hash non cryptographic hash function suitable for General hash Pace lookup created by someone Smasher what's the property of it like why not sha like that's my question let's ask why not sharp or Bloom filter graphic hash function and so such big answer you should consider Shah as the hash function if you have stricter security requirements okay oh that seems to be an answer they are also relatively slow to compute ah now my cryptographic hash function like sha have a good Collision oh sorry you folks might talk this wait wait I'll pull this out over here most of it will be there okay fine I've just moved console down for a minute okay while cryptographic hash functions like Shah have good Collision resistance property they are also relatively slow to compute compared to non-cryptographic Hash function like oh it's about speed so we can still use sharp but shy is slow right I didn't know that interesting so marmar hash is most about speed just going through it seems complicated that's why we don't need to know the details for now we'll focus on implementation there seems to be a library that gives us so I'll add it and what I'll do is I'll create a utility funk instead of okay let's do this uh I'll write a function vermer hash which would marble hash which takes in a key and it takes in the size the the maximum bound within which because obviously rash would be spitting in some huge range like in tourism like 2 32 or 260 something I'm not sure what the range is but it would be a big enough range so we would have to bring it within the pounds so I'll just change it to the size that we are playing with and it would spit out an integer in that range so I'll just copy paste a bunch of code find idle I don't care how this works let's first get this running and then we'll figure it out so I shall dot right copy and paste result hashtra dot mean copy and paste okay I'll sorted out what does the each of this mean so first we need to install this Library so I'll go to here go get installing and okay some seed is there so what did they pass in seed oh this you and 32 they're passing something okay so we'll Define our own in it oh we can pass it in init function of uh Bloom filter itself new Bloom filter so new Bloom filter can initialize the hash function itself for us and sort it out okay installation done we have this so we'll start with initializing uh the hasher number hash for us so I move it over here I'm not sure if I'm naming it correctly but that's fine and I'll set oh this is the type sure and it was hash Dot Dot that's what it is pretty hot it actually to the speed time dot now I'll seed it with that so when we by the way when we see the hash function like any time you are you want to send it a hash function by default you should pass some seed because uh random number generators are not actual random number that's pseudo random number so they need c to start with so which is what we provided a seed and I'll import time as well fine and hash 32 and this becomes r m hasher fine we did M hashar and we initialized our amp hasher to something and m hasher dot write data and okay then we have result hasher dot sum 32 I'm not sure what is this computed this I think there has to be way to clear it up as well pretty sad fine okay so what will we do we will write key to this hasher and ask it to create a sum 32 which returns as a result and then we would reset so every time we are invoking it we would be got it okay so what we are doing with this murmur hash will take us key and a size and it would spit out value in that range so mod size and now this is the result so the result colon equal to this #and hasher and we would return result okay so the mermaid hash function for us writes the key to the hasher then generates the and basically generates a 32-bit hash for us mod size we get it over here we reset it so that we can reuse it for the next iteration otherwise when you do write it would be appending continuously most probably I cannot fit into this code but that's what I guess it would do right okay and then we did reset and we returned result Okay so we have a hash function handy we have this now we can write add so in add what we'll do we'll pass the key through the marble hash and we will mod it by size what's the size of the bloom filter we do we have it we have function number hash Define afterwards so filter this filter this here we have size we have passed it let's also keep it as a variable so that we can refer it right so filter and size is equal to size so that I can do here D dot size fine so we took the num we have we are passing the key string and we would get the index and what we'll do is B of B dot filter of idx is equal to true that and while we do that uh in exist all we have to do is just b b dot filter off idx now what's idx be dot filter of idx now we'll compute idx once again whatever key we have Etc what's the error I don't think we did go started it for some time now so I'm just move it up okay and in my fewer arguments where not enough arguments undefined function hash main dot go line number line okay by that what was this hash whose title where did you get this hashtag or import hash right I import hash I don't have any problem I'll run and cannot use time dot now value of time dot time oh uh it needs to be integer so dot Unix and run and run of type N64 unit in 32. oh it gives n64. I converted to internet will check for will improve the correctness after this and cannot use in 32 as Type U and 32. [Music] to this so okay we did this now what main dot go azure.com32 mod size and hash 32 mod size invalid operation what is valid in that you in 32 and size is int U in 32. we can safely Typecast into U and 32. rather in this case because it would always be positive and result will be in 32 will convert it to it 32. I'm just being safe over but I'm just trying to get this thing running first optimization of transitional sort it out later okay uh line number 36 here also copy paste will do that so whatever hash of kids in 32 and this is end so let's change let's change size 2 and 32 that would solve most of the problems because anyway why would we have Bloom filter more than 4 billion interests that's fine so in 32 we'll change it to in 32 I'm just simplifying this so that we don't have to change a lot of stuff there so by key updated updated updated ah finally not enough argument to color that's fine my line number 36. we also passed in size so we don't have to do this we can just pass in size over here brilliant because in the marble hash function we also pass size that size over here and it should be not size but P dot size okay it's still throw in some error hash takes in key and [Music] um everywhere if it should be interested now it's gone and the final nail in the coffin instead of key marbara hashtags in so we convert string to this and pass in B dot size and done what else now it should run or not cut up cannot use wide Arrow string while you're white as type string in marmar hash hey there you go I see okay tend to do something to forget something else and we got something we inserted a we inserted B we inserted C we got a true B true C true okay our Bloom filter is of size 16. now let me walk you through the code that we have written up until now so uh yeah exactly are enough we cannot go below one byte in golang that's why we would change this implementation to byte array to optimize the amount of data that we would have so what did we do we defined a hash function now did this we defined a marble hash so we defined a function name armor hash we takes it a string and a size to define the bounds of it uh everything returns n32 for us we wrote it over here we then took the random armor hash on the key we found out uh in 32 value we modded by the size that we passed in we got some Index this would be the actual index at which we would be setting it true or false and In Bloom filter we defined a filter which is an area of bull and a size in 32 we will optimize this to do let me add it to do optimize this okay then we Define the function Bloom filter uh that size initialize it with this one add basically takes Keys passes the hash function get gives me the index I set at index corresponding to true then I check exist in exist function we pass we get the same index for the key we check whatever the value is true to true false or false by default everything is false the corresponding values that we pass in are set to true so we did a b c and it gave us true to true but what would happen I'm just trying to mimic a collision a b internet brilliant okay so from a to m I have I have everything let me run it gave me true for everything okay so let's do this I'll insert a b and c what we want to know is at which index did it right so in the add function let me print print Ln that uh wrote what's the key key at index is idx so this way we would know at which index are we writing so out of 16 it wrote at 15 5 and 11. so let me pass in ABC is done let me pass in D and when I pass in d what do I get it wrote 12 11 11 a here we see B and C has a collision right so B and C has a collision now if I just I'll just Define a function boom printable which would print the bullet Bloom filter entirely so I'll just do fmt dot println b Dot filter fine fmt Dot P dot print sorry okay here we see all the values from zeroth index one index we saw a going at 1 B going at 7 C going at 2 D going at 15 oh there is no oh here we see so uh here we see pseudo random number generation in action so what happened is last time we saw Collision this time we did dot because uh the hash function that was initialized we passed in a seed seed value over here and because in the seed value we pass in current time so at every iteration every time we are running the function it is getting a new value as its seed so random number will always generate with that c so for our testing purpose it is better if we just hard code this to something so U Win 32 I'll hard code it to let's say 10. right so that we get very consistent like we get basically deterministic values such rigid rules okay oh yeah we did not see coalition I'm just trying to get a collision there it would become easier for us okay ah we have this so we have defined the bloom filter of length 16 we wrote four keys to it a b c and d a was written at index 0 VC 0th index is true B was written at index 5 we see index 5 as true C was written at index 8 5 6 7 8 we see index 8 is true V Dot B at 0 we see 0 as true and we see A and D both go at zero right so this is what it's probability we take very less space to store this exact same information now we can we are checking for existence now let's try to find a case where we are looking for a key because this is a Sure Shot thing where now if I say that fmt println let's say if I want to check if uh e exists or not so let me do Bloom dot exist and I'll say e exist over let's see let's if we print if we exist or not he got index at uh I need to oh I don't want to write that another function to see with what index would e go into uh okay let's do this Bloom dot exist can return me for now for now as a hack it can return me an integer and a Boolean it's just a hack so that I can see at which index it is checking just a Hackman and ID access what it is what type it is in 32 . so we cannot print this Gap arrived even has it something so we'll print three things then key index and this is just a thing up until the time we get it up and running okay fine so wrote a at 0 be at five C at 8 d h 0 you see this corresponding e indexed at one we checked one is false it is false it says e is present no it's e is not present and indeed e is not present because we only wrote four now let me check for f and now when we check for f F goes to 7 7 is also not set it is giving us false now we do G and we run it G goes to 2 2 is also false very good I'm looking for Collision h a now you got a collision fine so we wrote only four Keys Bloom dot add we only wrote Four Keys a b c and d we check for a b c d over here through this Loop and e f g h so here we clearly see the Collision happening where a stored at 0 like the corresponding index like the corresponding bit is set to true it is the the bullet is set to true then B at 5 C at eight and yet zero correct then when we checked we could very clearly see that uh the bits 1 7 and 2 were false one seven and two were false and it indeed gave us false which means the values are indeed not present but when it gave us true the for H we never wrote H to our Bloom filter but H got indexed at 0 because of our marble hash 0 was true so it said H is present but in reality H is not present because we only wrote A B C and D this is the false positive case and this is where Bloom filter is an approximate data structure right so we see the collisions happening we see Bloom filter having false positives where the value is not present but it still says it is present but it is indeed helping us do things in a very efficient way because we are not storing actual keys anywhere all we are storing is just area of Boolean in this implementation will optimize it now uh we are doing it at that time uh so huge like very less space imagine this one bit imagine this as one bit we just have 16 elements that we are storing so 16 bits of information but we are able to hold a b c D4 characters in reality if you would put this in a set you would require pointers to construct a tree because assume that set is implemented with a tree you can implement it with other ways but the actual data needs to be stored the pointers needs to be stored very expensive but here what just happened is we stored the existing system information of the bloom filter in the filter right so that just makes it much much much much easier for us right or rather much efficient for us now we optimize this how big is Boolean in goli it said the Boolean in data type in goli is one byte but if you look carefully All We Care is to sorry is to set one bit of it why are we doing it the entire byte why are we wasting seven bits so let's optimize this let's play around with bits to optimize it and then we would also see how it would uh how the number of hash functions would vary the false positivity rate would plot graphs and all to understand how this varies okay fine it's a very nice exercise I should do it more often okay so first thing first what we would change so let's change the hash function reducing the size would increase the chances of quality enough says reducing the size would increase the chances of collision yes yes it would increase the chances of collision a lot reducing the size multiples of force and zero something to do let's see yeah I have just hard coded a seed Shelly I just had coated the seed for now just to induce the Collision there but now let's have some fun now instead first of all what we'll do is we will mimic the entire load so randomized will have huge huge uh size we would play with so let's say we take a bloom filter of size let's say 1000 and we would now induce random seeds or sorry uh random values so let me take uuid foreign random strings in golang but how would we know if it exists or not hey this is a huge function and how big how should we go about it just thinking it through so what I'm trying to do is I'm trying to mimic the load of because what I want to see is how false positivity how false positivity rate increases as we keep adding values to it right so as we keep adding values to it our false positivity rate should increase because now for a lot of keys ah we'll use the classic data science thing that we do okay so let's do this so the way we would test or the way we would measure the false positivity rate of Bloom filter so what we need to see is how for a limited size of Bloom filter that we have when we keep adding values to the bloom filter with fewer Keys it should be almost zero false positivity rate but as we keep adding lot of values in the bloom filter because we have limited number of bits almost all of them would be filled up then the false positivity rate should go up right so let us quickly do that thing so what I'll do is uh I'll create a test data set let me create a data set and data set would be a string data set because that's what we have been playing with so I'll create a test data set of let's say 1000 random words so let me take guid as words generate uuid in golang so I'm just above so what I care is I just want uh I just want a simple uuid to work with like uuid would become my string here go get and I'll do copy paste so what I'm doing is I'm preparing a data set for I equal to 0 I is less than 1000 I plus plus and uh for I'll generate a uid uid dot new and U Dot string so what I'm doing is I am I will store this in a map just so that we can see what exists and what does not and what does not so that we can compare the false positivity rate because what we aim is that if I have a bloom filter of size 1000 if I have a bloom filter of size 1000 how would my false positivity rate increase over time right we need to know the inflection point because initially we expect that false positivity rate should be zero but over time it should increase right because we are keeping on ingesting the the keys in the bloom filter right so map of string of Bool because this is just my data set so data set is equal to this and wear this and what I would do is I would keep adding data set data set of is equal to okay so I just prepared my random data set over here and now I would pick few keys from this data set paste uid here okay so I have prepared my data set now what I want to do is I want to iterate to first 10 and let me do it in list why why this complication foreign but how will I check if it is present or not uh string dot contains okay let me store it at two places then it's becoming far too complex than I thought right actually it's not a bloom filter which is complex the other it's effect unnecessary complications okay so let me have a look let me Converge on one so add off string I'll keep data set is equal to append so I did this I stored it in a string I'll do data set is equal to make uh string comma this and I'll pass in the size of 0 I'll keep on adding so now I have my data set but I'll also check to make it uh to make the lookup efficient in the data set I'll set data set map where I'll have make map of string good of U dot string again equal to 5. coolant so now we have prepared our data set now what we want to do is okay what we want to do is we want to see that for the bloom filter that we have created now blue filter we have created of size 1000 we have prepared our data set which contains 1000 Keys one thousand random keys now what we will do is for each one of them I'll keep inserting so what I would want to do is for every 10 keys that I insert I would measure what's the false positivity rate that I have so I'll split this data set into test into train and test so train implies basically what I'm putting into the bloom filter the existence case while half of them would not exist so uh how should I go about it for um for fixed Bloom filter size as the number of keys increase fine so for I equal to 0 for I equal to 0 I is less than lenov data set and I plus equal to 2. so what we'll do is for every time we would add one thing into the data set we would want to add something into the bloom filter so Bloom dot add and I would add data set of I so I'm adding element into data set of file and then I would want to check if that key exists or not okay let me converge I want to plot this I want to plot this such that now we don't treat this this is the false positivity rate of this how should I go about it I'm just trying to figure out an easy way to do so Hardware there are far too many things [Music] a basic blue filter is anyway working I'm just trying to what I want to do is I want to plot the false positivity rate of it opportunity I have the data set and spread it into this I'll check for the key that exists check for the Creator does not exist okay let's just do that fine so I'll set half I'll insert half the half of the keys I'll insert half of the keys and I'll check for all the keys and what we should see is we should see the number of correct and number of incorrect so Bloom Dot exist and data sets of I fine okay okay so where count equal to so these are count of positives uh false positivity rate is what the keys that that the bloom filter says exist but in reality does not exist right so which are those keys so what I've done over here is we would want to know the keys that exist and the keys that does not exist so what I'll do is count of uh instead of this huh yeah so we have inserted data set added 1000 removed it so what I'll do is I'll just insert data set uh exist so I'll do it for 500 and I'll change it to so what I'm doing is I'm just splitting my data set into two half over here itself so these are the ones that exist and then I am generating again and I'll generate half of them again which I'll say not exist and I'm just creating two I in my data set I have 1000 but I am splitting it into two so data set exists data set not exist data set exists data set not exist it will just make a like much simpler okay so I've split my data set into two half 500 here 500 there and now what we'll do is we have uuid dot new we have inserted by third time and insert it insert into bloom filter and now we would want to see the size of it I never thought it would be this this huge I thought it would be much simpler for us to implement it to test it but either it's pretty interesting now and far far into this problem for how do we iterate a map in go for hey I seriously forgot how to write a range hearing system for uh e comma something we don't want that range and data set exist and I'll do Bloom dot add Loom Dot add and I'll pass in key okay done and now for key so all the key that should exist we have added it into the Chrome filter now we would iterate over all of them now we will maintain cow count or what Bloom filter says exist uh what actually exists count of no we'll just count this so what do you want to count we want to count for key or underscore key in the range bloom uh data set so for all the keys that exist in the data set all of them we would see that what Bloom filter says about it so we say is what Bloom filter says about this key if it exists or not and we would see if it exists on data set exist of key uh okay so now if our if it exists in the data set and Bloom filter says it exists it is true positive right so true positive plus plus uh uh if Bloom filter says exist okay what we care we care if Bloom filter says exist and it actually exists then we do true positive plus plus right and if uh Bloom filter says exist and R and the key actually does not exist right if the bloom filter says exist sorry if the bloom filter says exist but in reality the key does not exist because we have not added the data set underscore not exist car keys in this then we do not true positive we do false positive plus plus hey we just need this because we know how many keys we are testing we are testing these many we just will print after this iteration done so what we want is false positive we will set false positive equal to zero we will do false positive divided by lenov data set let's see what do we get we got direct line 79 if okay not okay exist Keys faulty plus plus index error cannot use underscore okay equal to data search dot existence value hey isn't that a syntax map of string of pool how to check if he exists in a map and not JavaScript yeah I thought it would give me in this context in map in go lab wow that's what I did that is something no underscore okay colon equal to map of Key false positive plus plus syntax error cannot use underscore okay colon equal to data set not exist of key value Keys value what am I missing insert fifty percent of keys you have to add colon oh yeah thanks man thanks thanks thanks thanks thanks okay oh God it wrote so many things I have to remove the logging thing where did I wrote that array flow division I have to convert it to float 64. and I'll convert this into float 64. maybe float 64. 0.188 is the false positivity rate which means 18 percent pretty high okay so now what we want to do is we want to see how it increases over time so as we keep on uh insert so what we just saw is our false positivity rate is 18 I'll just multiply it by 100 also it would just me pretty simple so every time we run we see around 2019 19 percent of false positivity rate over here right pretty interesting so we inserted half of the keys we took 100 we instead of half of the keys over there but now if I increase the fraction my Bloom filter size is 1000 but let me add 1500 Keys into it right definitely my false positivity rate should go like out of the place it should be 100 percent if it is not then something's wrong then there is very high Collision rate eh what happened why is false positivity rate not increasing I have a bloom filter size 100 he did something funny okay that's interesting is reducing the size of the blue filter pump so it's 60. false positive great huh quite interesting who have lots of learning lots of learning I equal to zero I is less than two five zero zero I am inserting inserting inserting into data set and I'm checking for non-existent I am inserting all of them in my Bloom filter setting Bloom filter F2 filter size 16. 60 I've set it to one still it is 16. let me print let me print my Bloom filter Bloom dot print let's see it's 60. but what we are checking is if it is not existing oh okay huh number of false positives this is where the problem is so because we are checking let's say number of false positives is 500 all of them and we are dividing it by length of data set if false positive it is 500 hey what is length of data God this is pretty interesting see length of data set is three thousand five hundred divided by ah 16.6 500 divided by three thousand okay how much is yes the test needs test yeah okay did we do what did we do wrong false positive zero editing false positive let me get rid of this hundred not sure if it is doing anything funny it's definitely not it's still doing that so 500 over here let me change it to 500 see devil always now it gave 0.5 which makes sense oh it's oh it's simply dividing it by that which means in for the first case oh now it gave 0.5 now let me change Bloom filter to size 100 or 1000. uh now 1.75 now we got it now now we got it now it's working but it was actually working fine as well there was no harm in it we are just missing like we are just doing something funny over here uh false positive plus plus so key Bloom filter says it exists and data set not exist and it is actually not existing in our data set then it means that it is a false positive inviting false positive plus plus but false positive divided by what I think our division is wrong because we can we should not divide it by the entire data set we should divide it by what because it would always give me because no but we are iterating on the data set right we are attending on the retained on 500 and then test on one thousand now they are correct uh basic 500 divided by 3000 is 16 that's true and which is why we are getting constantly 16 17 range ticket which means it's correct so false positivity rate seems corrective so just to confirm a few things if I increase which means that bloom filter is getting really heavily filled if I increase the size of the bloom filter false positive rate should decrease let's see ah now it's zero ah nah now it's there right okay let's do this okay let's now that we have converged Bloom filter size well 0.5 our division seems wrong because it should like for half of the keys it says true oh now half present half not present it would be point five and now if I increase my Bloom filter size to 2 it should have some effect it's so small that it's all true ah this is exactly what we would what we should plot right we are just testing for 1000 Keys fixed we would change the length of the bloom filter and we would see how it varies so what I'll do let let's just let's just put this in a loop for uh J is equal to let's start with 10. J is less than ten thousand oh ten ten thousand like because initial values would always be 0 0.5 10 000 J plus equal to 100 I'm just iterating it by 100 at a time and printing the false positivity rate good now let's see what we get point five point five point five point five point five point five point five point five common it's all 0.5 throughout my bad Jake are they five in 32 yes it is decreasing yes yes yes okay we see it started with 0.5 because everything got collided there and then it sharply decreased see point five point five point four point three point three two point two seven two five six two three eight one nine one seven one we see as the length of the Bloomfield so for the fixed number of keys where we put half in exist and half not exist we see the false positivity rate decreasing to almost 0.03 so it started from 0.5 to it rained it so from from 50 percent it dropped to three percent for 1000 keys and so for 1000 keys and ten thousand as my uh Bloom filter well if I change it to let's say 20 000 we should see almost almost zero I'm just trying to bring in that zero because there are far fewer collisions that are happening there is enough space now point zero point zero one three so it went to 1.3 percent right but it's pretty decent but now here one one beautiful thing that we see is that even though we increase the length by two like it we make 2x like from 10 000 to 20 000 there is not much difference here if we see carefully there is not much difference in this value so here we see even if we still see big enough value so what this tells us is that you don't really need to have a massive Bloom filter because after a certain point in time you would not see huge enough Improvement because here it was 0.5 and then here it is point zero seven point zero five point zero three the thing is you stuck with the value like you stay in the value that suits you the best you don't have to go a really high range so the point is how should you estimate the size of a bloom filter so this is exactly what we do so very recently I had to Implement a bloom filter or rather I use the blue filter by redis uh and what I did is I just did this number crunching where I knew my injection rate I knew the quote unquote maximum number of values that I would fit in then I did something very similar of an analysis there is a Formula you can find it on the Wikipedia page that gives us how big a bloom filter should be depending on the false positivity rate that you want right so here we see that false positivity rate does not change much after a certain range and it tells us that be happy be content with what false positivity rate that you need and you size it out according to that great great great great we Face a lot of works but we solve it while it is we doubted our implication was correct all around all all across but that's why that's the beauty of Life query you get some you lose some you learn something new okay but now we still want to do one final thing like I want to do one one final thing so that's just one optimization one for by the way thanks folks for staying for for bearing with me for this but uh let's do that one optimization where uh we want to do bit manipulation right so bit manipulation uh here we see that uh the size of the Boolean is one byte what we were setting is we are actually setting that one byte here or there like true or false uh what we should do is we should not be wasting the seven bits of that one byte because all we care is zero or one so can we change this to byte which is where we would have something really interesting so now what we do we broke the code now we have to fix it so uh we'll still keep the size of the bloom filter but if you look carefully uh our iteration the way we are doing for 100 to 10 000 the iteration that we had they are all divisible by eight so basically what we would do is when we are assigning the size of the bloom filter we would do what we are doing is our Bloom filter is an array of bytes each byte is one bit uh does golang have an INT 8 I'm not sure if it has intake let me check it and it hasn't it better better better better better you eat it nice okay so I set my Bloom filter to uint8 which means its integer of it's an unsigned integer of length 8 Bits right so what we'll do is when we set a value we get an index right now we would go to that bit in this area of integers and set that bit to 1. so we are making it space efficient because we earlier for us to store 100 booleans we were using 100 sorry for us to store 1000 booleans we were using 1000 bytes but now instead of that we would use 1000 divided by eight bytes which is equal to 125 so from 1000 to 120 125 bytes so 8X reduction in size is what we are getting okay so we did is the code breaks now we have to fix it now the few things that would change is when we write it the marble hash function that we wrote it gives us an index but now what we know is each element of my array is eight bits within that 8888 so if my if the murmur hash gives me let's say three I don't have to do Bloom filter of 3 because I have each element contains eight bits and out of that I would ideally want to update the third bit in my gigantic bit array for us that bit array is actually array of integer so I need to find out the index in the array and then the bit within that array right so what's the index in that array so if my murmur hash function gives me three I three third bit is in the 0th element and the third bit within that so if it gives me 9 then it would be in the first uh first element of it if it gives me 17 it would be in the second right so it's multiplied by it so my actual array index so array index would be uh whatever idx gave me divided by 8 hey no man is it listening hey it also gave me modded hey man AI the more are you oh God why why am I even writing the code it's generating the code for us it's exactly what we should it's exactly what we should be like we wanted to do and it's scary okay fine so aidx is the index of the element that needs to be manipulated and within that which width is B idx so bit index a index is the index in the array D Index is the index of the bit within that eight bits of that we have so it will be mod 8 and now what we'll do is we would change in this element I would want to set that corresponding bit to true how do we set bit to true is I do an R of it or uh and I want to set that to one bit to true so yes that's exactly what I want to do and uh I'll set that element is equal to so element is equal to element bitwise or one left shoot by those many bytes so bit is zero it would not be shifting by anything one done done done okay done so this would say dust corresponding bit to true okay so setting is done now what do you want to do hey is it really done all right listen uh now a get we get is also similar now get means we want to get that corresponding bit out of it so how do we get it I didn't believe it until like I use this this much of Ghostwriter sub code okay fine a serious serious mode okay so first what we want same con uh same computation AI dxb idx now what we want we want to get from that element which is B dot filter uh aidx and from that we want to get get that bit how do we get that bit that bit is equal to and and of that corresponding bit I ended with that bit but I want to discard all other bits so it should be all other bit 0 and this bit as one will this work I'm not sure uh B of filter of a of idx that and one right one left should be idea it would work right oh it would work so that's it because now I would just want to convert it to Boolean if it works I'm not sure let's see so uh if it's greater than zero or equal to if it is greater than zero it's true otherwise false okay let me just let me just counter validate that b dot filter of a of idx so for that element and uh I want to add all other bit 0 that corresponding bit to 1 so if it is one then it is one everything else is zero would mask it out that corresponding bit value would propocet if it's zero it's zero if it's one it's one it would work let's run 0.5 point five point five match okay let's run the loop and see if it uh if our value still matches we ran it in ten thousand let's do it to ten thousand yes yes yes yes yes yes Bloom filter working it ended at 0.03 which is what we saw the last time as well we also optimize the space now great fun huh AI scared us today isn't it okay it's a simple right so we just optimize the space by a factor of 8 which is great but we had to do a lot of bit manipulations fine so again as you see you get some you lose some you had a large space you could directly access that index and set it to true or false but here what you did is uh uh you did a bunch of bit manipulation to get it done a bunch a bunch of more CPU required but you get some gurusa right if space is more important you do this if CPU is more important you do that right but we did very fancy stuff with blue filter so we see again we see that same pattern we optimize it over here now should we do one more there is one more thing effect of hash functions right so up until now we just use one hash function right can we use multiple hash functions so what we'll do is we'll not do this entire uh thing once again so what we would do is we would run the same experiment with multiple hash functions now so we'll take this one value let's say we take this one value which is ten thousand we know for 10 000 the false positive rate false positivity rate is 0.02 if I run it again 0.020 where did we add some Randomness there I thought it should give us the same thing on this oh uuid uh strings are different sorry sorry strings are different right strings are doing the same strings are different okay so we know the false positivity rate is around 0.02 so we'll use this same number sorry to do this and but now what I want to change as a parameter is number of hash functions so up until now we have used one hash function uh now let's use three hash functions so I'll use three hash functions one with seed value 100 one bit seed value let's say some random number of some random thing m hasher one m hasher two M Azure a a people would stop calling me engineer if I do this let me put it all inside an array so that I can change the size of it uh Tic Tac tuck what I'll do is I'll initialize it with the random values because what we want to do is we want to compare the effect of hash functions now so I'll put hash FNS like this and I'll change this hash function M hasher changes to array of hashes and I'll keep it as fracture equal to this and okay so what I'm doing is I'm creating an area hash functions so that we can see the effect of I'm just typing some random number to simulate and in a really good news okay we have how many hash functions one two three four five six seven eight nine ten okay we have 10 H functions now now what will we do is we want to see the effect of hash function on the bloom filter so we would use the same Bloom filter the same data set that we went with thousand but we'll keep the size of the bloom filter same but I'll just change it to 10 000 but what we will iterate on is when iterate on hash function right so we'll see the number of hash functions that we would use and then we would see how it affects the false positivity rate of my Bloom filter right we'll start with one and we'll go to 10. so for I equal to 0 is less than Len of hash function say what's the variable name hash sure sure pressure for this okay so what will we do we'll change it to line of hash functions I plus plus so we want to start with at least one so I equal to 1 and uh sticker that we'll see what happens in case it breaks it takes so now uh few things that we would want to change is when we are adding we should pass how many hash functions are we considering right earlier we were just passing keys but now because that's a variable for us that's why we are doing it so we'll pass how many hash functions we are passing so let's say we are passing I number of hash functions right and what we'll do is we'll print false positivity rate for each one of them so every time okay let me just sort it out so every time we are creating a new Bloom filter every time we are doing add we are passing the number of hash functions we would want to use to store it inside the filter so in add functions we would change that the number of hash functions that we would want to do it with and here we would apply the formula okay for I equal to 0 I is lesson number of hash functions I plus plus and I would put all of this inside of for Loop and instead of just using murmur hash I would use hash functions of I oh no I have this marble hash function okay so okay we can pass hash function index okay hash FN idx so instead of using m hasher it uses hash FNS of hash function index to write it inside disk and then it resets that okay done so we compute marble hash for that index we'll get a lot of Errors now okay so where else our murmur hash okay here is our normal hash add may exist where this add function add okay so for each of this hash function will pass in the same Barber hash key B dot size and I right so which hash function should we use for each hash function so what we are doing is for the number of hash functions let's start with 1 2 3 4 and so on and so forth for each of the hash function we are Computing for the same result set what's the false positivity rate and we are iterating we are getting that murder hash we are using that specific hash function we are putting it inside the bloom filter like setting the corresponding bit and Computing fine let me throw and see a bunch of Errors first main dot go line number 14 will first sort it out okay what else broke Main Line once align 61. when we use multiple hash functions we need to check that if all of them okay In Bloom filter I would say the key is present if all if the values of if for all the hash functions that I'm using if all of them says yes then I would say that it exists otherwise I would say it would it it does not right so that's the idea so I would need to keep an eye on if all of them are true or not right if all of them are true so if I'm using three Ash functions I would use where I'm storing I would take my key pass it through the first hash function get the index set it corresponding bit second hash function same key another index same as thought as function same key third index and setting it right now I am checking I would do the exact opposite with all the three as functions would use all the three s functions check for each of the hash function the generated index go to that index see if it exists or not for a second see if it exists or not third if it exists or not if all the three says exist then I would say it exists otherwise I would say it does not right so I'm doing I would have to check if all of them are true or not so a nice implementation of it is let me take a exist as an array I know it's not a good good but that's fine uh where exist what do we want we want array of Boolean and I would do a make of it error Boolean length will be equal to number of hash functions because we would need those many to be there and I would check these exist and exist equal to up at exist exist return value remains same the idea being if any one of them is false then I'd say it's false now correct if any one of the hash if any one of the index given by any one of the hash function is false I would say it's false so I don't need this I Would by default say exist equal to false no X is equal to true exist equal to true and if any of the hash function says if exist is false which means any of the hash function says it's false I can immediately return I don't need to check for other hash functions anyway I can immediately return that this is false otherwise I don't need this I would say it's true simple I would say it's true and for now I don't need index because we don't need it and exist may I also need to pass the number of hash functions I'll pass it here and B dot size remains as is I need to pass in which hash function I'm talking about and done is it it okay let's just quickly go through it for all the hash functions that I have I am checking what the index for that corresponding hash function is checking the bit if it says it exists or not if it says it does not I short circuit and I immediately return false because if any one of them is false I would say key does not exist but if all of them are true okay let's write hey line number 99 oh God okay exist I need to pass the index I for all the hash functions do we have a bloom exists or not hey fine exists do we need to pass hash functions you need to pass all we need to pass the length of the hashtag this way let out the hash functions uh but land of the hash functions is okay let's run hey Tran see for one hash function it spit out 0.026 but it spit out the last time we are doing 10 000 keys ten thousand yeah we are doing uh length of the bloom filter ten thousand number of keys as one thousand for one hash function it spit out 0.026 as false positivity rate and as the number of hash functions increased it went till zero it's pretty interesting because now what it is happening is that because you have Diversified your hash functions you are using all of at the same time now it's very low probability that all of those keys I like all of those corresponding bits are set for two different keys with that exact same sequence very difficult right so that's why it went out but now what it means is that because I'm using multiple hash functions when I'm putting it in a bloom filter I am filling out bits very quickly which means that if I increase the hash functions to even let's say let's make it 20. let's go beyond that right if I increase it then it should be getting worse okay all the all this starts okay no repetition there is some notification let's see if it increases it's still zero zero zero hey but you see it it took it took far long to compute because the number of hash functions increased it took far long to compute ah because now it has to do so much of computation while it is inserting and not and now we see is still zero after some time you see some correlations happening over here for this use case but uh it totally depends on the random that we have yeah let me write a code for this I don't know let's change the number of hash functions to uh uh uh uh uh for I equal to 0 I is less than what is less than 100 I plus plus okay so I'll just do this time dot now dot Unix right I'm just setting it with this and hash functions is equal to hash function is equal to make uh hash function as 32 and of length 0 for I equal to 0 I is less than I'm just defining the number of hash functions that I have let's say I start I go with 100 and I just keep adding to that I'm just lazy there are a few mosquitoes you don't need this I am definitely sure in this time let's run hey what happened Lane of hash functions in it Lane of hash function is equal to appendix appendix Apprentice okay what did we do wrongly hey then we'll do is this running it is running uh hash functions equal to append time dot now I am adding 100 hash functions u in 32 and I am inserting it in time perfect and setting it to initial seeds oh oh oh oh it's a pretty Niche mistake I get it wait wait wait this is very interesting I make this mistake more often than I would want to admit so here what happened is here I defined the global variable hash FRS here I am doing a colon equal to so it creates a local local variable with name hash FNS ah now it work now it worked now it worked but it's bitter the same thing every time now what happened oh yeah I don't want to debug it again uh what did we do rock so number of hash functions when it main function what we are doing we are hitting lenov hash functions I plus plus hey hey did I add did I not add it to the uh function s as functions as last 10 minutes folks that we love off 100. it's a silly mistake somewhere we initialize it to zero seed issue I'm not sure if it's seed issue uh time dot now because they ah yes yes yes yes yes yes yes yes yes it is seed issue because uh you in 32 will convert it to Unix Unix gives in seconds within one second oh god really interesting issue oh brilliant so now what is happening is uh Unix gives in seconds and this entire iteration is finishing within one second so they're all getting the exact same value okay let's change this to a random integer oh God never expected to run into this issues but I'm so happy by the way thanks uh who who said that uh thanks 2K IQ you have a 2K IQ int 64. uh no 832 I'll just remove this fluid 32 over here now let's see if it works please please please work please ah zero zero zero zero ah it increased yes yes yes it increase again finally we saw what I wanted to see okay see how beautiful this is and we could see a short circuit evaluation reduction as well so we not really we saw very interesting issue by the way thanks again 2K IQ for this yeah you two live up to your name man uh okay so uh we see this uh starting with 0.25 false positivity rate it decreased to zero for a very long time and then when your hash function increase till let's say uh 30 something then my false positivity rate again increased it again increased and it kept on increasing see it it's increasing and increasing and increasing 15 percent 20 percent and 23 percent if I would have added more hash monsters it would have definitely if imagine if I have 1000 hash functions for thousand keys in the Thousand Bloom filter false positivity that will be 0.5 with respect to our example because 50 50 case right it would have very high pulse positivity because every day it would be setting all the bits to zero right but it did increase so this shows us that having large number of hash functions is good but having far too many hash functions will make your program go slow number one plus after some point of time you would see elevated false positivity rate and if you look carefully see the execution pattern it's slow slow and then now see it would pick up speed why is it happening because the way we checked for existence is if any of if the index in my Bloom filter is 0 or is uh for that index is false which means it does not exist for that corresponding index for that corresponding hash function where immediately exiting short circuit evaluation right because of the short circuiting it is not requiring to iterate through all the hash functions when it encounters one of them does not exist and that's why we see very Speedy completion when my hash functions are very large because of the baby rotar code over here if you change this logic and we evaluated for all and then we see if all of them are true or not then it would have taken equal amount of time or rather the time would have increased as with the hash function but because of a short circuit evaluation editor so this is interesting but the most interesting thing 2K I give thanks for that I would have never guessed it it would have taken me uh at least hours to debug what would have got what would have happened there but uh that was fascinating issue that we saw uh with that seed because time.miss we just blindly use it as a random seed every time but see how impactful that is that we saw quite quite a few interesting things there right so very interesting things that we covered very interesting issues that we tackled we saw impact of hash functions we saw impact of Bloom filter size uh we measured the false positivity rate we wrote a basic Loop filter with a Boolean array and then we change it to bit array oh sorry uh byte byte array where we did bit manipulation to set that bit and all we slash the space usage by a by factor of eight Finally Friday night and right thanks thanks folks thanks folks for tuning in I don't have anything more to do but uh this was quite part by the way you can hunt on my replit and check this thing out we did we did quite a few things there so very interesting stuff thanks for tuning in folks this was fine but up do check out replicated in case you want I have my joining link there in case you are interested Hopper lock I I love not having Hazard to set it up out of V what shot and a bit manipulation is ready hey humans don't write good with manipulation I understood we wanted to do mod eight and like uh divided by it and more that one love hurt that fine but uh at least helping us be productive there great folks thanks thanks so much uh thanks folks for today again very uh thanks again Friday evening uh have some fun I have a few meetings to catch up with an office so I'll jump on to those meetings now but a quite interesting discussion maybe next time I'll pick up something new I have five now I'm thinking because I I had fun doing this I'll definitely do it more often even though I make hundreds of mistake it's fine it is basically getting our hands dirty seeing these interesting issues makes us better Engineers anyway great thanks thanks folks for today again I'll end the live stream now uh yeah you go and you play with it and see thanks thanks thank you so much have a great evening I'll stop The Stream and stream and streaming stream yeah yeah that's for sure I'm I I'm I loved it 100 percent uh because I could be me right that's the best part I could be me rather than having that fully recorded stuff I could do thanks folks

Here's the video ⤵

Courses I teach

Alongside my daily work, I also teach some highly practical courses, with a no-fluff no-nonsense approach, that are designed to spark engineering curiosity and help you ace your career.


System Design Masterclass

A no-fluff masterclass that helps SDE-2, SDE-3, and above form the right intuition to design and implement highly scalable, fault-tolerant, extensible, and available systems.


Details →

System Design for Beginners

An in-depth and self-paced course for absolute beginners to become great at designing and implementing scalable, available, and extensible systems.


Details →

Redis Internals

A self-paced and hands-on course covering Redis internals - data structures, algorithms, and some core features by re-implementing them in Go.


Details →


Writings and Learnings

Knowledge Base

Bookshelf

Papershelf


Arpit's Newsletter read by 100,000 engineers

Weekly essays on real-world system design, distributed systems, or a deep dive into some super-clever algorithm.