PySpark SQL Code Examples

Fabio C.

Published May 1, 2024

PySpark SQL is a module in the Apache Spark ecosystem that provides a programming interface for handling structured and semi-structured data with SQL (Structured Query Language).

It facilitates the easy integration of SQL queries with PySpark applications, hence easing the analysis and manipulation of structured data in a distributed computing environment. PySpark SQL is a popular tool for data exploration, querying, and ETL activities, and is especially useful for data scientists and engineers working with large-scale, structured datasets.

First, you should create a temporary table or view on DataFrame to use SQL queries. Once created, this table can be accessed throughout the SparkSession using sql().

To use SQL queries, you must first construct a temporary table or view on DataFrame. Once built, sql() can be used to access this table at any point throughout the SparkSession. The scope of these views and tables is limited to the SparkSession that generated them. The temporary views are deleted from memory when the SparkSession is ended, either deliberately or by shutting down the Spark program.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Nome da sua aplicação") \
    .getOrCreate()

pyspark.sql.SparkSession is a foundational class in the Apache Spark ecosystem that provides a single interface for working with structured data in Spark, including support for DataFrames, SQL, and Datasets.
.SparkSession.builder is used to create a builder to configure your Spark session.
.appName("Your application name") defines the name of your Spark application.
.getOrCreate() attempts to get an existing Spark session or create a new one if one does not exist.

Query from Dataframe

#Import sql class from Spark
from pyspark.sql import SparkSession
# Create a SparkSession object
spark = SparkSession.builder.appName("FabioCarquiAnalysisSQL").getOrCreate()
#Read a csv file
df = spark.read.options(delimiter=';').csv("/FileStore/shared_uploads/fabiocarqui@gmail.com/01_sales-1.csv", header=True, inferSchema=True)
#Selected Columns
sql_query = df.select("Product_Category", "Revenue")
sql_query.show()

#Filter rows
filtered_data = sql_query.filter(sql_query.Revenue > 2)
filtered_data.show()

Recommended by LinkedIn

SQL Scripting in Upcoming Apache Spark 4.0

Jules Damji 11 months ago

Practical Apache Spark in 10 minutes. Part…

Igor Bobriakov 7 years ago

Using SQL Expressions in PySpark with the expr Function

Randhir Singh 1 year ago

TempView

Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it terminates.

The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.

#Import sql class from Spark
from pyspark.sql import SparkSession
# Create a SparkSession object
spark = SparkSession.builder.appName("FabioCarquiAnalysisSQL").getOrCreate()
#Read a csv file
df = spark.read.options(delimiter=';').csv("/FileStore/shared_uploads/fabiocarqui@gmail.com/01_sales-1.csv", header=True, inferSchema=True)
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("tv_products")
sqldf = spark.sql("SELECT * FROM tv_products where Revenue =2")
sqldf.show()

More Details

Important classes of Spark SQL and DataFrames:

______________________________________________________________________________________

Your Opinion Is Priceless

Positive and constructive criticism are both forms of feedback that are essential to progress. I want you to share your ideas, insights, and even areas of misunderstanding with me as I work hard to bring useful content in these editions. Future versions will be better suited to your requirements and goals thanks to this feedback loop.

Do you have a pressing concern or topic?

Geovanny Ferreira da Silva 1y

Great post.

To view or add a comment, sign in

PySpark SQL Code Examples

Fabio C.

Query from Dataframe

Recommended by LinkedIn

TempView

More Details

Your Opinion Is Priceless

More articles by Fabio C.

Others also viewed

03-Not So Basic SQL - Build your own DBMS from the first principles

Import Data into Postgres Table Using Pandas

SparkCore && SparkSQL

Quick SQL Tutorial - Essential Guide for Beginners

Apache Spark Basics 101: select() vs. selectExpr()

Understanding Partitioning in Hive vs Apache Iceberg with PySpark

Hive UDFs

Thinking out of box - an example

A Complete Tutorial of Apache Hive

Reading and Writing Avro Files From the Command Line

Explore content categories

Query from Dataframe

Recommended by LinkedIn

TempView

More Details

Your Opinion Is Priceless

More articles by Fabio C.

Join SMS Messages x Journey

Salesforce Journey Builder - TIPS

How getting opens, clicks and bounces of a journey email?

PySpark Shuffling

Free PySpark Quiz Test Your Coding Skills

10 Tips PySpark Code Optimization

Troubleshooting Spark Errors

Caching and Persistence of Data

Coverting timestamp and load a table

Create a table using a parquet file

Others also viewed

03-Not So Basic SQL - Build your own DBMS from the first principles

Import Data into Postgres Table Using Pandas

SparkCore && SparkSQL

Quick SQL Tutorial - Essential Guide for Beginners

Apache Spark Basics 101: select() vs. selectExpr()

Understanding Partitioning in Hive vs Apache Iceberg with PySpark

Hive UDFs

Thinking out of box - an example

A Complete Tutorial of Apache Hive

Reading and Writing Avro Files From the Command Line

Explore content categories