Creator of DiceDB, ex-Google Dataproc, ex-Amazon Fast Data, ex-Director of Engg. SRE and Data Engineering at Unacademy. I spark engineering curiosity through my no-fluff engineering videos on YouTube and my courses
You’re working on an experimental social network feature at Meta that tracks “friend clusters” - groups of friends who frequently interact together. For a test dataset of varying number of users, you need to analyze group interactions.
The data is stored as follows: whenever users interact in a group
(through group chat, group video calls, or group tags), you record a score of
1
for each pair of users in that group for that day. For example,
if users A, B, and C have a group video call, you record interactions (A,B), (B,C), and (A,C).
After collecting 30 days of data, you notice that your server’s memory usage is extremely high. Upon investigation, you find that most users interact with less than 50 other users per month, despite having hundreds of friends.
Your task is to:
Consider both time and space complexity in your solution, as this will be deployed in production. Inputs for which the approach should be compared against a simple one are
(10000, 5000, 5)
(10000, 5000, 5)
(10000, 5000, 10)
(100000, 50000, 5)
(100000, 50000, 10)
(100000, 100000, 10)
Each of the above tuple is (n_users, n_groups, group_size) where
n_users
: The total number of users in the social network. For example, 1000 means there are 1000 unique users who could potentially interact.n_groups
: The number of group interactions that will be simulated. For example, 500 means there will be 500 different group gatherings/interactions. Each group interaction creates multiple pair-wise connections between its membersgroup_size
: Number of people in each group interaction. For example, 5 means each group activity involves 5 people. For each group of size 5, we create (5 * 4) / 2 = 10
pair-wise interactionsUse this information to generate random data, build an efficient solution, and measure the efficiency and trade-offs.
Here’s the code for reference and some notes on the solution below.
Sparse matrices are the key concept here because most entries are 0.
Now, register the interactions data in a sparse matrix ve regular matrix and use heapq.nlargest
function to find the top 3 users who interact most frequently with a given user.
Generate random data for the given inputs and measure the time and space complexity of the solution by comparing it with a simple naive approach. Insertion time and query time are easy to measure. To measure memory, we can use the following code snippet
# Measure memory usage
if isinstance(impl, SimpleInteractionTracker):
memory = sum(len(v) for v in impl.interactions.values()) * 12 # Approximate
else:
memory = impl.interactions.data.nbytes + impl.interactions.indptr.nbytes + impl.interactions.indices.nbytes
Doing this for all the inputs and comparing the results will give us a good idea of sparse matrix vs regular matrix and how much memory we can save by using sparse matrices and what is the trade-off.
Arpit's Newsletter read by 125,000 engineers
Weekly essays on real-world system design, distributed systems, or a deep dive into some super-clever algorithm.