Does the SQL Language Need a Pipe (|) Operator?
SQL remains a steadfastly declarative language. Apache Spark™ DataFrame API transformations draw its declarative implementation inspiration from SQL's declarative nature to declare what to do with the data and not how to do it. Yet SQL queries do get quite complex, and often developers struggle with deeply nested subqueries. As a result readability suffers.
So the simple answer is YES, we do! Let's examine why.
People love Spark DataFrame declarative API, because they allow us to to cobble together a string of logical and sequential expressions, from start to finish. Second, most queries begin with a table name, followed by a string of operations related to the data in the table.
Consider this simple example: imagine if you wanted to write a Spark DataFrame query by chaining some simple operations on data in a table, customers:
from pyspark.sql.functions import desc
result = spark.table("customers") \
.filter(col("region") == "North") \
.join(spark.table("orders"), col("customers.id") == col("orders.customer_id")) \
.select("customer_name", "order_date", "amount") \
.orderBy(desc("order_date")) \
.limit(10)
Traditional SQL without the Pipe (|) operator would look as follows: the table name is buried after the projection and join, which is the opposite from the above Python code.
SELECT customer_name, order_date, amount
FROM customers
JOIN orders ON customers.id = orders.customer_id
WHERE region = 'North'
ORDER BY order_date DESC
LIMIT 10
But if you use the Pipe (|) operator, the code is much simpler, readable, and matches in flow, format, and declarative in style as the DataFrame code above: specify the table, filter, projection, order, and limit--similar to the flow in our DataFrame example.
Using a the general Pipe (|) operator syntax of the form:
Recommended by LinkedIn
FROM T
| WHERE A < 100
| SELECT B + C AS D
| ....
We can translate our PySpark DataFrame code into SQL using the Pipe (|) operator-based syntax:
FROM customers
| WHERE region = 'North'
| JOIN orders ON customers.id = orders.customer_id
| SELECT customer_name, order_date, amount
| ORDER BY order_date DESC
| LIMIT 10
Take another complex example of traditional SQL subquery:
Now consider the same query written with the Pipe (I) operator syntax:
The Pipe(|) operator affords a few worthy merits. In a Databricks blog, SQL Gets Easier: Announcing New Pipe Syntax, Daniel Tenedorio introduces this feature in the upcoming Apache Spark 4.0 release; I summarize its key benefits:
To read more about this new Pipe (|) operator syntax check the following resources: