In this essay, we will model taxonomy on top of a relational database, and as a specific example, we will try to build Udemy’s Taxonomy. The primary focus of this essay is to understand how to design taxonomy on top of SQL based relational DB, define and write queries that are computationally efficient along with deciding indexes on the designed tables.
In the process, we will also understand a very interesting SQL construct like Window Functions that helps us solve seemingly complex use-cases with a single SQL query.
Udemy’s Taxonomy
Udemy’s Taxonomy is very simple; it features top-level categories - like Software Engineering, Arts, and Business - each category has multiple sub-categories - like Programming Languages, Databases, Sketching - and each sub-category has niche topics like - Python, Javascript, MySQL, etc.
To keep things simpler, we restrain that one topic can be part of only one sub-category and one sub-category can belong to only one top-level category; and that makes the maximum levels in this taxonomy as 3
.
Database Design
Out of our intuition, we can have one table for categories, one for holding sub-categories, and one for topics, and a bunch of foreign keys that weaves them together. But is this the best we can come up with? A few issues with this design is
- all the 3 tables will have an identical schema
- if we were to introduce a new level, say
concept
that sits between sub-category and topic, we will have to create a new table to accommodate it, making this design cumbersome to future features and extensions. - what if for a few topics we want it to be a child of a category, leaving out sub-categories altogether; handling this with this design will be very tricky.
So, we need a better design, that is robust and extensible and hence we go for a single table called topics
that holds categories, sub-categories, and topics differentiated with a column called type
distinguishing between the 3. The schema of this table topics
would be
Now that we have the table topics
ready, we see how the following two topics are stored
- Software Engineering > Programming Languages > Python
- Software Engineering > Programming Languages > Javascript
Indexes on topics
Picking the right set of indexes is one of the most critical decisions that you will be taking while designing this system. A good set of indexes boosts the overall performance of the system, while poor and/or missing ones will put your database under a terrible load, especially at scale.
But how do we pick which indexes do we want on topics
? The answer here is very simple, it depends on the kind of queries we have to support. So, let’s list down queries that we will need and then determine indexes to make them efficient.
Get topic by ID
The most common query that we’d need is getting a topic by its id
and this is very well facilitated by making id
as a primary key of the table.
SELECT * FROM topics WHERE id = 531;
Get the topic path
Getting a topic path is an interesting use case. While rendering any category, sub-category, or topic page we would need to render breadcrumbs that hold the path of it in the taxonomy. For example, for Python’s page, we will need to render a path like
Software Engineering > Programming Languages > Python
This path helps users explore and discover new categories, sub-categories, or topics. So, with our current schema, how could we compute the topic path for a given topic id.
Doing it on the application side is the first approach that comes to mind but it is a poor one because we would be making n
selects for n
levels. In the case of our current system, we will be making 3
selects to compute the topic path; with the application pseudocode looking something like this
def get_topicpath(topic_id):
path = []
topic = get_topic_by_id(topic_id)
path.append(topic)
while topic.parent_id:
topic = get_topic_by_id(topic.parent_id)
path.append(topic)
return path
We can do a lot better than this. Since we know that the hierarchy has at max 3 levels, we can just do this in one SQL query with minor NULL
handling on the application side.
The SQL query to get the topic path would have to join 3
instances of the topics
table, each one handling one level in the hierarchy and joining with its parent on parent_id
. The SQL query would fetch the id
and the name
of the topics in the topic path.
SELECT topics_level1.id, topics_level1.name,
topics_level2.id, topics_level2.name,
topics_level3.id, topics_level3.name
FROM topics AS topics_level3
LEFT JOIN topics AS topics_level2
ON topics_level2.id = topics_level3.parent_id
LEFT JOIN topics AS topics_level1
ON topics_level1.id = topics_level2.parent_id
WHERE topics_level3.id = 610;
In the SQL query above we fetch the topic path for topic id 610
. We join table topics
twice (3 instances of topics table) each handling a distinct level. Since we are using JOIN, if a parent_id
is NULL
and the join parameter would not match anything which would result NULL
selects for those columns. These NULL
values come in very handy when we compute the topic path for sub-categories and categories.
If the topic with 610
id is of type topic
then
topics_level1.id
,topics_level1.name
will be categorytopics_level2.id
,topics_level2.name
will be sub-categorytopics_level3.id
,topics_level3.name
will be topic
If the topic with 610
id is of type sub-category
then
topics_level1.id
,topics_level1.name
will beNULL
topics_level2.id
,topics_level2.name
will be categorytopics_level3.id
,topics_level3.name
will be sub-category
If the topic with 610
id is of type category
then
topics_level1.id
,topics_level1.name
will beNULL
topics_level2.id
,topics_level2.name
will beNULL
topics_level3.id
,topics_level3.name
will be category
So, in the application code, we still access all the selected columns but we create the topic path skipping the NULL
values accordingly.
To support this query, our table only requires Primary Key on id
and Foreign Key on parent_id
.
Get all the children of a category or a sub-category
Getting all the children of a category or a sub-category will be heavily used to drive the “Browse and Explore” page, where users would want to drill down and explore the kind of topics Udemy covers. SQL Query for this has to support pagination and will be required to output all children for a given parent, in order of score
such that more popular children are returned first.
SELECT * FROM topics WHERE parent_id = 123 ORDER BY score DESC;
The SQL query above fetches all the child topics of a given parent topic with id
= 123
. Since we are ordering by score
, for this query to be efficient we create a composite index on (parent_id, score)
.
Get category hierarchy
Udemy, on its home page, puts out all the categories under a dropdown menu enabling users to explore top categories and topics in a glimpse.
One peculiar behavior of this is it shows all categories and top k
sub-categories within each. Once we hover upon a sub-category it makes a network call to fetch top topics within that sub-category. This means we need to write a query that fetches all categories and k
sub-categories within each category from the entire topics
table.
Although it looks very complicated at first, it is very easy to do with a single SQL query.
SELECT t1_id, t1_name, t2_id, t2_name, t2_score
FROM (
SELECT topics1.id AS t1_id, topics1.name AS t1_name,
topics2.id AS t2_id, topics2.name AS t2_name,
ROW_NUMBER() OVER (PARTITION BY topics1.id) row_num
FROM topics AS topics1
LEFT JOIN topics AS topics2 ON topics1.id = topics2.parent_id
WHERE topics1.type = 1 and topics2.type = 2
ORDER BY topics1.score DESC, topics2.score DESC
) t
WHERE row_num <= 10;
Above SQL query picks all categories and top 10
sub-categories from each category and returns it as part of SELECT
. It uses a very interesting SQL construct called Window Functions, specifically ROW_NUMBER
and PARTITION BY
.
We perform the usual join on topics
once where the left operand is categories (topics with type = 1
) and the right one is a sub-category (topics with type = 2
). We then partition this join by category id
and then compute ROW_NUMBER
for sub-categories within it.
The row numbers are computed for each partition separately so it goes as 1, 2, 3, ..., n
for n
rows within each category. We then apply a simple WHERE
clause check on this row number to be <= k
which then typically matches the first k
row within each partition i.e category.
Note: to get “top” k
sub-categories we just apply for an additional ORDER BY
on score
that sorts the sub-categories ensuring top sub-categories are fetched first. This way the first k
rows we consider from the partition are essentially the top sub-categories within the category.
To make this SQL query efficient we would need a foreign key on parent_id
and an index on score
to make ORDER BY
efficient.
Summary of indexes we need on topics
- Primary Key on
id
- Foreign Key on
parents_id
- Index on
type
- Composite Index on
(parent_id, score)
Explore more
Although we covered quite a bit of this DB design there is always something interesting in exploring something new around this topic; so
- explore Nested Set Model to design Taxonomy on relational databases
- explore how DB engines behave when there are no indexes, you can use
EXPLAIN
to understand the behavior - find if there could be a better alternative to paginate results apart from
LIMIT/OFFSET
Thus we designed a neat Taxonomy on top of SQL-based relational databases like MySQL, Postgres, etc; wrote queries for some common scenarios, and determined the indexes to make taxonomy efficient.