PySpark SQL Code Examples
PySpark SQL is a module in the Apache Spark ecosystem that provides a programming interface for handling structured and semi-structured data with SQL (Structured Query Language).
It facilitates the easy integration of SQL queries with PySpark applications, hence easing the analysis and manipulation of structured data in a distributed computing environment. PySpark SQL is a popular tool for data exploration, querying, and ETL activities, and is especially useful for data scientists and engineers working with large-scale, structured datasets.
First, you should create a temporary table or view on DataFrame to use SQL queries. Once created, this table can be accessed throughout the SparkSession using sql().
To use SQL queries, you must first construct a temporary table or view on DataFrame. Once built, sql() can be used to access this table at any point throughout the SparkSession. The scope of these views and tables is limited to the SparkSession that generated them. The temporary views are deleted from memory when the SparkSession is ended, either deliberately or by shutting down the Spark program.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Nome da sua aplicação") \
.getOrCreate()
Query from Dataframe
#Import sql class from Spark
from pyspark.sql import SparkSession
# Create a SparkSession object
spark = SparkSession.builder.appName("FabioCarquiAnalysisSQL").getOrCreate()
#Read a csv file
df = spark.read.options(delimiter=';').csv("/FileStore/shared_uploads/fabiocarqui@gmail.com/01_sales-1.csv", header=True, inferSchema=True)
#Selected Columns
sql_query = df.select("Product_Category", "Revenue")
sql_query.show()
#Filter rows
filtered_data = sql_query.filter(sql_query.Revenue > 2)
filtered_data.show()
Recommended by LinkedIn
TempView
Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it terminates.
The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.
#Import sql class from Spark
from pyspark.sql import SparkSession
# Create a SparkSession object
spark = SparkSession.builder.appName("FabioCarquiAnalysisSQL").getOrCreate()
#Read a csv file
df = spark.read.options(delimiter=';').csv("/FileStore/shared_uploads/fabiocarqui@gmail.com/01_sales-1.csv", header=True, inferSchema=True)
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("tv_products")
sqldf = spark.sql("SELECT * FROM tv_products where Revenue =2")
sqldf.show()
More Details
Important classes of Spark SQL and DataFrames:
______________________________________________________________________________________
Your Opinion Is Priceless
Positive and constructive criticism are both forms of feedback that are essential to progress. I want you to share your ideas, insights, and even areas of misunderstanding with me as I work hard to bring useful content in these editions. Future versions will be better suited to your requirements and goals thanks to this feedback loop.
Do you have a pressing concern or topic?
Great post.