Arpit's Newsletter read by 70000+ engineers
Weekly essays on real-world system design, distributed systems, or a deep dive into some super-clever algorithm.
To handle trillions of data points and petabytes of data every single day, Discord needs a simple yet robust Data Platform.
Here’s a quick overview of their arch and key design decisions 👇
A data platform comprises a set of services that ensures data is replicated from various databases, put in central storage, and making it available for consumption by internal teams, and services.
Discord needs to analyze the data to
For example, firing queries across orders (MongoDB) and payments (MySQL) to get the items that generated the most revenue.
Discord uses Google’s BigQuery as its Data Warehouse (a place to keep and query large volumes of data). The data is stored, processed, and consumed across 3 layers
Let’s understand each layer in detail.
The transactional layer comprises the transactional databases used in powering the microservices. These databases typically act as a source of truth for the services.
Microservices are free to choose the flavor of the database - SQL or NoSQL to power their usecase.
The core layer holds the series of tables in BigQuery that are populated using the transactional layer.
Data pipelines replicate the data from various transactional databases, like MongoDB, MySQL, etc into a set of structured core tables, and become the input for the subsequent Derived Layer.
Derived Tables are the actual consumable tables created from a set of core tables. Each team can create its own set of derived tables by joining a set of core tables as per their need.
Each derived table is essentially a SQL query on core tables. The specified SQL query is fired periodically to join and replicate the unprocessed data into a derived table.
Each derived table has its own configuration file that holds
A replication strategy is also specified in the YAML file that implies if the output of the SQL query should append to, merge with, or replace the existing derived data.
A separate K8S pod is run for each derived table that ensures an isolated continuous data replication to the derived tables.
Thus, each team can define its own set of derived tables using just a SQL query, enabling teams to make data-driven decisions.
Here's the video ⤵
Super practical courses, with a no-nonsense approach, are designed to spark engineering curiosity and help you ace your career.
An in-depth, self-paced, and on-demand course that for early engineers to become great at designing scalable, available, and extensible systems at scale.
A masterclass that helps experienced engineers become great at designing scalable, fault-tolerant, and highly available systems.
A course that helps covers Redis internals by reimplementing its core features like - event loop, serialization protocol, pipelining, eviction, and transactions.
Arpit's Newsletter read by 70000+ engineers
Weekly essays on real-world system design, distributed systems, or a deep dive into some super-clever algorithm.