These are my notes based on the paper SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL.
TL;DR
SQL has long been the dominant language for structured data processing, through this paper GoogleSQL team introduced a new pipe-structured data flow syntax that significantly improves SQL’s readability, extensibility, and ease of use.
The approach involves adding pipe operators (|>
) to SQL which essentially breaks down complex queries into steps, making them easier to understand and maintain. This syntax enables operations to be composed arbitrarily, in any order, drastically simplifying complex queries and improving readability. Three things I found interesting were
- pipe syntax that makes SQL more linear, extensible, and readable
- ”prefix property” that allows running partial queries to see intermediate results, aiding in debugging
- experimental debugging operators (
ASSERT
,LOG
,DESCRIBE
)
Three interesting things
Continuing the discussion from the above TL;DR, here are three things that I found interesting in this paper and some quick details about each.
Notes and a quick explanation
SQL, the de facto standard for structured data processing, has stood the test of time for 50+ years. Google proposed an extension to SQL and introduced pipe-structured data flow syntax to improve its usability and extensibility.
The Fundamental Issues with Standard SQL
SQL’s rigid clause order (SELECT ... FROM ... WHERE ... GROUP BY
) doesn’t reflect the actual data flow, which begins with table scans in the FROM
clause. This structure also complicates the process of extending the language with new features This disconnect leads to several issues, like
- SQL uses duplicate clauses (
WHERE
,HAVING
,QUALIFY
) to work around rigid clause order - many simple operations require subqueries, leading to deeply nested, hard-to-read code
- SQL’s structure makes tracing logic difficult, especially in large queries
- adding new query operations is challenging and there is an over-reliance on reserved keywords
The Pipe Syntax Solution
In traditional SQL, queries are typically written in a single, monolithic statement. GoogleSQL’s new pipe syntax allows for a sequential approach, where the output of one operation is “piped” as the input to the next. This approach is intuitive and also aligns with how data is processed.
With GoogleSQL’s pipe syntax, queries can have zero or more pipe operators as a suffix, delineated with ”|>“. Here’s a quick example of how a SQL query would look with pipe syntax
SELECT column1, column2
FROM table1
WHERE condition1
|> JOIN table2 ON table1.id = table2.id
|> SELECT column3
|> ORDER BY column3 DESC;
Each pipe operator is a unary relational operation, taking one table as input and producing one table as output. Intermediate results in SQL are tables with one or more columns and zero or more rows. This structure ensures composability and allows operators to be applied in any order, any number of times. Each pipe operator is self-contained, seeing only its input table and arguments. This isolation makes pipe operators naturally composable.
Extensibility
The pipe syntax dramatically improves SQL’s extensibility with Table-Valued Functions (TVFs). The CALL
operator allows invoking TVFs directly without nested subqueries. Here’s a quick example
SELECT '<text>' AS input, 7 AS rating
|> CALL ML.PREDICT(MODEL `my_project.nnlm_embedding_model`)
|> CALL ML.PREDICT(MODEL `my_project.imdb_classifier`)
Built-in Operator Extensions
Pipe syntax simplifies the addition of new built-in operators. For example, the PIVOT
operator, which is awkwardly implemented in standard SQL, becomes a naturally composable pipe operator:
FROM customer JOIN nation ON c_nationkey = n_nationkey
|> SELECT n_name, c_acctbal AS bal, c_mktsegment
|> PIVOT(SUM(bal) AS bal FOR n_name IN ('PERU', 'KENYA', 'JAPAN'))
Experimental Debugging Operators
GoogleSQL also introduces debugging operators that leverage pipe syntax
ASSERT
: adds assertions to SQL queriesLOG
: logs intermediate result tables for debuggingDESCRIBE
: returns schema information for intermediate tables
Deep Observations on Syntax Flexibility
A good thing about this new syntax is its flexibility in accommodating both existing SQL operations and potential future extensions. Hence the existing features of SQL like complex joins, subqueries, and aggregations will not be abandoned and could be expressed in a piped syntax.
Pipe syntax also supports user-defined functions (UDFs) in a more natural and readable manner. This feature is crucial where data transformations often require custom operations that are not natively supported by SQL. The modular nature of the pipe syntax ensures that these UDFs can be incorporated without breaking the flow of the query.
Implementation and Adoption
To get this adopted, the GoogleSQL team designed the pipe syntax as an extension of the existing SQL syntax, rather than a replacement. This allows users to gradually adopt the new syntax in their queries, without needing to fully commit to it from the outset.
The implementation produces the same algebra as standard SQL, requiring minimal work for query engines to enable the feature. This approach makes pipe syntax immediately a first-class feature in supporting query engines.
GoogleSQL implemented pipe syntax as a reusable component shared across multiple query engines, including F1, BigQuery, Spanner, and Procella. This shared language analysis component allowed for a single implementation to be enabled across multiple engines.
The content presented here is a collection of my notes and explanations based on the paper. You can access the full paper SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL . This is by no means an exhaustive explanation, and I strongly encourage you to read the actual paper for a comprehensive understanding. Any images you see are either taken directly from the paper or illustrated by me .