Persistent Data Structures - An Introduction


Ordinary data structures are ephemeral implying that any update made to it destroys the old version and all we are left with is the updated latest one. Persistent Data Structures change this notion and allow us to hold multiple versions of a data structure at any given instant of time. This enables us to go back in "time" and access any version that we want.

In this essay, we take a detailed look into the world of Persistent Data Structures and see the basics of its implementation along with where exactly we can find them in action. This essay is meant to act as an exhaustive introduction to the topic and in future essays, we will dive deeper into the specifics of each data structure.

This essay is loosely based on the iconic paper published in 1986 titled Making Data Structures Persistent by James R. Driscoll, Neil Sarnak, Daniel D. Sleator, and Robert E. Tarjan.

Persistence

Persistent Data Structures preserve previous versions of themselves allowing us to revisit and audit any historical version. Depending on the operations allowed on the previous versions, persistence is classified into three categories

Partially Persistent

Partially Persistent Data Structures allows access to all the historical versions but allows modification to only the newest one. This typically makes historical versions of the data structure immutable (read-only).

https://user-images.githubusercontent.com/4745789/107117960-e4a66480-68a3-11eb-97a0-c8c412527471.png

Fully Persistent

Fully Persistent Data Structures allows access and modification to all the historical versions. It does not restrict any modifications whatsoever. This means we can typically revisit any historical version and modify it and thus fork out a new branch.

https://user-images.githubusercontent.com/4745789/107117958-e112dd80-68a3-11eb-971b-58034e693f44.png

Confluently Persistent

Confluently Persistent Data Structures allow modifications to historical versions while also allowing them to be merged with existing ones to create a new version from the previous two.

https://user-images.githubusercontent.com/4745789/107117954-da846600-68a3-11eb-9c34-b9489c170710.png

Applications of Persistent Data Structures

Persistent Data Structures find their applications spanning the entire spectrum of Computer Science, including but not limited to - Functional Programming Languages, Computational Geometry, Text Editors, and many more.

Functional Programming Languages

Functional Programming Languages are ideal candidates for incorporating Persistent Data Structures as they forbid, while some discourage, the mutability of underlying structures. These languages pass around states within functions and expect that they do not update the existing one but return a new state. Programming languages like Haskell, Clojure, Elm, Javascript, Scala have native Persistent implementations of data structures like Lists, Maps, Sets, and Trees.

Computational Geometry

One of the fundamental problems in Computational Geometry is the Point Location Problem which deals with identifying the region where the query point lies. A simpler version of the problem statement is to find if a point lies within or outside a given polygon. A popular solution to determine a solution to the Point Location problem statement uses Persistent Red-Black Trees.

Text and File Editing

The most common operation required by any Text or File editing tool is Undo and Redo and having persisted all historical versions through a persistent data structure makes these most frequent operations very efficient and a breeze.

Implementing Partial Persistence

There are a few generic techniques that help in implementing Partial Persistence. Just to reiterate, partial persistence allows access to all the historical versions but allows modification to only the newest one, to create a newer updated copy of data.

Copy on Write Semantics

A naive way to implement Partial Persistence is by utilizing the Copy-on-Write semantics and naively creating a deep copy upon every update. This technique is inefficient because upon every writes the entire data structure is deep copied.

There are certain methods built upon certain storage paradigms that copy only what matters making the entire CoW efficient. I have already gone in-depth of Copy-on-Write semantics and I encourage you to check that out.

The Fat Node Method

Instead of creating copies of the entire data structure, the Fat Node method suggests that each cell holding the value within the data structure is modified to hold multiple values (one for each version) making it arbitrarily fat. Each value node thus holds a value and a version stamp.

class Node:
    def __init__(self, value: object):
        # references to other Nodes creating a Node topology
        # that forms the core of the data structure
        # this could be `next` and `prev` pointers in a linked list
        # or `left`, `right` in case of a binary search tree.
        self.refs: List[] = []

        # holding all the values against the version (key)
        self.values: Dict[int, object] = {}

The Node-Copying Method

The Node-Copying method eliminates all the problems with the Fat Node method. It allows each node to hold only a fixed number of values in it. Upon exhaustion of space, a new copy of Node is created and it holds only the newest values in it. The old node also holds pointers to the newly created node allowing browsing.

Having this structure helps in making update operation slightly efficient by reducing the number of nodes to be copied during writes, considering the in-degrees to each node is bounded.

Path-Copying Method

The path-copying method copies all the nodes coming in the path from the root to the node being modified. This way it tries to minimize the copy and promotes reusing some of the unmodified data. This method comes in super handy in Linked Data Structures like Lists and Trees.

https://user-images.githubusercontent.com/4745789/107144006-33b0d000-695e-11eb-9e13-959eaeba44f4.png

Implementation of Full Persistence is a mammoth of a topic on its own and deserves its own essay. Hence skipping it from the scope of this one.

Reducing the memory footprint

Since persistent data structures thrive on high memory usage, they require some garbage collection system to prevent memory leaks. Algorithms like Reference Counting or Mark and Sweep serves the purpose pretty well.

Thus when a historical version is not referenced anymore in the program space, the corresponding objects and nodes are freed up.

References


Arpit Bhayani

Arpit's Newsletter

CS newsletter for the curious engineers

❤️ by 17000+ readers

If you like what you read subscribe you can always subscribe to my newsletter and get the post delivered straight to your inbox. I write essays on various engineering topics and share it through my weekly newsletter.




Other essays that you might like


Leaderless Replication

444 reads 2022-01-16

In this essay, we take a look into a different way of replication, called Leaderless Replication, that comes in handy in...

Master-Replica Replication

350 reads 2021-08-07

In this essay, we talk about everything we should know about Master-Replica replication pattern....

Bitcask - A Log-Structured Fast KV Store

574 reads 2020-07-19

Bitcask is a Key-Value store that persists its data in append-only log files and still reaps super-performant read-write...

The RUM Conjecture

670 reads 2020-05-31

While designing any storage system the three main aspects we optimize for are Reads, Updates, and auxiliary Memory. RUM ...


Be a better engineer

A set of courses designed to make you a better engineer and excel at your career; no-fluff, pure engineering.


System Design Masterclass

A masterclass that helps you become great at designing scalable, fault-tolerant, and highly available systems.

800+ learners

Details →

Designing Microservices

A free playlist to help you understand Microservices and their high-level patterns in depth.

17+ learners

Details →

GitHub Outage Dissections

A free playlist to help you learn core engineering from outages that happened at GitHub.

67+ learners

Details →

Hash Table Internals

A free playlist to help you understand the internal workings and construction of Hash Tables.

25+ learners

Details →

BitTorrent Internals

A free playlist to help you understand the algorithms and strategies that power P2P networks and BitTorrent.

42+ learners

Details →

Topics I talk about

Being a passionate engineer, I love to talk about a wide range of topics, but these are my personal favourites.




Arpit's Newsletter read by 17000+ engineers

🔥 Thrice a week, in your inbox, an essay about system design, distributed systems, microservices, programming languages internals, or a deep dive on some super-clever algorithm, or just a few tips on building highly scalable distributed systems.



  • v12.7.8
  • © Arpit Bhayani, 2022

Powered by this tech stack.