My first change in a massive open source CPython codebase

Watch the video explanation ➔

Open-source projects are a treasure trove of knowledge and a great opportunity for engineers to enhance their skills. However, delving into a complex codebase can be overwhelming, especially when trying to find the starting point. In this article, I will guide you through the process I follow to navigate massive open-source code bases and locate the initial entry point. To demonstrate the methodology, I will use CPython as a reference, which is the C implementation of Python.

Setting up the Code Base

To begin, clone the CPython repository, which contains the entire source code of Python. Though it may take some time due to the size of the code base, the effort is worthwhile. In the repository’s readme file, you will typically find setup instructions. For CPython, the necessary commands to set up the project locally are ./configure and make. These two commands are common across most C-based open source projects.

Once you have the code base set up locally, you can build the Python executable by running the make command. The resulting python binary is specific to the version and source code you have. By executing ./python, you can verify the version and see that it matches the one mentioned in the readme file.

Finding the Entry Point

When working with a large code base, understanding where the execution begins is crucial. Since CPython is implemented in C, we know that it should have a main function, as execution typically starts there. To locate the main function, a common approach is to use a text search within the repository.

By searching for main in a case-sensitive manner, and filtering for files where main is followed by an opening brace, we can narrow down the results. Scrolling through the list of files, we can identify the most relevant ones. In the case of CPython, the main.c file under the Modules folder seems promising.

Examining the main.c file, we discover the presence of Py_Main and Py_BytesMain functions, which serve as entry points for both Windows and UNIX platforms. Clicking on Py_Main, we find the exact location where the main function is defined. Understanding the conditional compilation, we see that Py_Main is invoked for Windows, while Py_BytesMain is used for UNIX.

Making Changes and Building the Binary

To make our first change, let’s add a printf statement to the main function, indicating that this is our version of CPython. We add the statement in the main.c file, right after the function’s opening brace. Alternatively, we could have added it in the python.c file where the main function is defined.

After making the modification, we can compile and build the binary by running the make command again. This process takes a few seconds and generates the updated python executable. Running ./python now reveals that our custom printf statement appears as the first output, confirming the successful incorporation of our changes.

Embracing the Open Source Journey

Navigating and understanding a massive open-source code base may seem daunting at first. However, with time and familiarity, it becomes an achievable task. By gradually exploring the code, locating the entry point, and making small modifications, you can gain confidence and deepen your understanding of the project.

Now that you have successfully grasped the process of navigating a large code base like CPython, you can use this knowledge to explore other open-source projects. Remember, it takes time and persistence to become comfortable with complex code bases. Embrace the opportunity to learn from and

Here's the video ⤵

Courses I teach

Alongside my daily work, I also teach some highly practical courses, with a no-fluff no-nonsense approach, that are designed to spark engineering curiosity and help you ace your career.


System Design Masterclass

A no-fluff masterclass that helps experienced engineers form the right intuition to design and implement highly scalable, fault-tolerant, extensible, and available systems.


Details →

System Design for Beginners

An in-depth and self-paced course for absolute beginners to become great at designing and implementing scalable, available, and extensible systems.


Details →

Redis Internals

A self-paced and hands-on course covering Redis internals - data structures, algorithms, and some core features by re-implementing them in Go.


Details →


Writings and Learnings

Knowledge Base

Bookshelf

Papershelf


Arpit's Newsletter read by 90000+ engineers

Weekly essays on real-world system design, distributed systems, or a deep dive into some super-clever algorithm.