Does the SQL Language Need a Pipe (|) Operator?
Watch the recording

Does the SQL Language Need a Pipe (|) Operator?

SQL remains a steadfastly declarative language. Apache Spark™ DataFrame API transformations draw its declarative implementation inspiration from SQL's declarative nature to declare what to do with the data and not how to do it. Yet SQL queries do get quite complex, and often developers struggle with deeply nested subqueries. As a result readability suffers.

So the simple answer is YES, we do! Let's examine why.

People love Spark DataFrame declarative API, because they allow us to to cobble together a string of logical and sequential expressions, from start to finish. Second, most queries begin with a table name, followed by a string of operations related to the data in the table.

Consider this simple example: imagine if you wanted to write a Spark DataFrame query by chaining some simple operations on data in a table, customers:

from pyspark.sql.functions import desc

result = spark.table("customers") \
    .filter(col("region") == "North") \
    .join(spark.table("orders"), col("customers.id") == col("orders.customer_id")) \
    .select("customer_name", "order_date", "amount") \
    .orderBy(desc("order_date")) \
    .limit(10)        

Traditional SQL without the Pipe (|) operator would look as follows: the table name is buried after the projection and join, which is the opposite from the above Python code.

SELECT customer_name, order_date, amount 
FROM customers 
JOIN orders ON customers.id = orders.customer_id 
WHERE region = 'North' 
ORDER BY order_date DESC 
LIMIT 10         

But if you use the Pipe (|) operator, the code is much simpler, readable, and matches in flow, format, and declarative in style as the DataFrame code above: specify the table, filter, projection, order, and limit--similar to the flow in our DataFrame example.

Using a the general Pipe (|) operator syntax of the form:

FROM T
| WHERE A < 100
| SELECT B + C AS D
| ....        

We can translate our PySpark DataFrame code into SQL using the Pipe (|) operator-based syntax:

FROM customers 
| WHERE region = 'North' 
| JOIN orders ON customers.id = orders.customer_id 
| SELECT customer_name, order_date, amount 
| ORDER BY order_date DESC 
| LIMIT 10        

Take another complex example of traditional SQL subquery:

Article content
Traditional SQL complex subquery without Pipe Operator

Now consider the same query written with the Pipe (I) operator syntax:

Article content
Improved with the Pipe Syntax operator

The Pipe(|) operator affords a few worthy merits. In a Databricks blog, SQL Gets Easier: Announcing New Pipe Syntax, Daniel Tenedorio introduces this feature in the upcoming Apache Spark 4.0 release; I summarize its key benefits:

  • SQL's core benefit as a declarative language focuses on "what" rather than "how."
  • The readability improves with Pipe (|) operator in composing complex SQL queries.
  • The Pipe (|) operator offers a more intuitive approach by enabling sequential or chaining transformations, similar to DataFrame API.

To read more about this new Pipe (|) operator syntax check the following resources:

To view or add a comment, sign in

More articles by Jules Damji

Others also viewed

Explore content categories