Question
You’re working on an experimental social network feature at Meta that tracks
“friend clusters” - groups of friends who frequently interact together. For a test
dataset of varying number of users, you need to analyze group interactions.
The data is stored as follows: whenever users interact in a group
(through group chat, group video calls, or group tags), you record a score of
1
for each pair of users in that group for that day. For example,
if users A, B, and C have a group video call, you record interactions (A,B), (B,C), and (A,C).
After collecting 30 days of data, you notice that your server’s memory usage is
extremely high. Upon investigation, you find that most users interact with less
than 50 other users per month, despite having hundreds of friends.
Your task is to:
- Design an efficient data structure to store this interaction data
- Write a function that can quickly find the top 3 users who interact most frequently with a given user
- Estimate the memory usage of your solution for different number of users (as given below)
- Compare your solution’s memory usage with a simple naive approach
Consider both time and space complexity in your solution, as this will be deployed in production. Inputs for which the approach should be compared against a simple one are
(10000, 5000, 5)
(10000, 5000, 5)
(10000, 5000, 10)
(100000, 50000, 5)
(100000, 50000, 10)
(100000, 100000, 10)
Each of the above tuple is (n_users, n_groups, group_size) where
n_users
: The total number of users in the social network. For example, 1000 means there are 1000 unique users who could potentially interact.
n_groups
: The number of group interactions that will be simulated. For example, 500 means there will be 500 different group gatherings/interactions. Each group interaction creates multiple pair-wise connections between its members
group_size
: Number of people in each group interaction. For example, 5 means each group activity involves 5 people. For each group of size 5, we create (5 * 4) / 2 = 10
pair-wise interactions
Use this information to generate random data, build an efficient solution, and measure the efficiency and trade-offs.
Solution
Here’s the code for reference and some notes
on the solution below.
Sparse matrices are the key concept here because most entries are 0.
- We have 1 million users (1M × 1M potential interactions)
- Each user interacts with < 50 people (most entries are 0)
- This creates a classic sparse matrix scenario where >99.995% of entries are 0
Now, register the interactions data in a sparse matrix ve regular matrix and use heapq.nlargest
function to find the top 3 users who interact most frequently with a given user.
Generate random data for the given inputs and measure the time and space complexity of the solution by comparing it with a simple naive approach. Insertion time and query time are easy to measure. To measure memory, we can use the following code snippet
# Measure memory usage
if isinstance(impl, SimpleInteractionTracker):
memory = sum(len(v) for v in impl.interactions.values()) * 12 # Approximate
else:
memory = impl.interactions.data.nbytes + impl.interactions.indptr.nbytes + impl.interactions.indices.nbytes
Doing this for all the inputs and comparing the results will give us a good idea of sparse matrix vs regular matrix and
how much memory we can save by using sparse matrices and what is the trade-off.