PySpark SQL Code Examples

PySpark SQL Code Examples

PySpark SQL is a module in the Apache Spark ecosystem that provides a programming interface for handling structured and semi-structured data with SQL (Structured Query Language).

It facilitates the easy integration of SQL queries with PySpark applications, hence easing the analysis and manipulation of structured data in a distributed computing environment. PySpark SQL is a popular tool for data exploration, querying, and ETL activities, and is especially useful for data scientists and engineers working with large-scale, structured datasets.

First, you should create a temporary table or view on DataFrame to use SQL queries. Once created, this table can be accessed throughout the SparkSession using sql().

To use SQL queries, you must first construct a temporary table or view on DataFrame. Once built, sql() can be used to access this table at any point throughout the SparkSession. The scope of these views and tables is limited to the SparkSession that generated them. The temporary views are deleted from memory when the SparkSession is ended, either deliberately or by shutting down the Spark program.


from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Nome da sua aplicação") \
    .getOrCreate()        

  • pyspark.sql.SparkSession is a foundational class in the Apache Spark ecosystem that provides a single interface for working with structured data in Spark, including support for DataFrames, SQL, and Datasets.
  • .SparkSession.builder is used to create a builder to configure your Spark session.
  • .appName("Your application name") defines the name of your Spark application.
  • .getOrCreate() attempts to get an existing Spark session or create a new one if one does not exist.


Query from Dataframe

#Import sql class from Spark
from pyspark.sql import SparkSession
# Create a SparkSession object
spark = SparkSession.builder.appName("FabioCarquiAnalysisSQL").getOrCreate()
#Read a csv file
df = spark.read.options(delimiter=';').csv("/FileStore/shared_uploads/fabiocarqui@gmail.com/01_sales-1.csv", header=True, inferSchema=True)
#Selected Columns
sql_query = df.select("Product_Category", "Revenue")
sql_query.show()        
Article content
#Filter rows
filtered_data = sql_query.filter(sql_query.Revenue > 2)
filtered_data.show()        
Article content

TempView

Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it terminates.

The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.


#Import sql class from Spark
from pyspark.sql import SparkSession
# Create a SparkSession object
spark = SparkSession.builder.appName("FabioCarquiAnalysisSQL").getOrCreate()
#Read a csv file
df = spark.read.options(delimiter=';').csv("/FileStore/shared_uploads/fabiocarqui@gmail.com/01_sales-1.csv", header=True, inferSchema=True)
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("tv_products")
sqldf = spark.sql("SELECT * FROM tv_products where Revenue =2")
sqldf.show()        
Article content

More Details

Important classes of Spark SQL and DataFrames:

Article content

______________________________________________________________________________________

Your Opinion Is Priceless

Positive and constructive criticism are both forms of feedback that are essential to progress. I want you to share your ideas, insights, and even areas of misunderstanding with me as I work hard to bring useful content in these editions. Future versions will be better suited to your requirements and goals thanks to this feedback loop.

Do you have a pressing concern or topic?


To view or add a comment, sign in

More articles by Fabio C.

  • Join SMS Messages x Journey

    How to relate _Journey with SMSMessageTraking? As a workaround, if you need to report on tracking of Journey Builder…

  • Salesforce Journey Builder - TIPS

    Salesforce Marketing Cloud's powerful marketing automation tool, Journey Builder, enables companies to design, oversee,…

  • How getting opens, clicks and bounces of a journey email?

    Effective tracking of the customer journey necessitates data collection following the identification of client…

  • PySpark Shuffling

    Shuffling is one of the main memory programming operations in Apache Spark. During Spark jobs, the process is in charge…

  • Free PySpark Quiz Test Your Coding Skills

    1. What is PySpark? PySpark is a Python interface for Apache Spark.

  • 10 Tips PySpark Code Optimization

    A combination of training classes, experience, and information went into creating this essay. This is not a…

  • Troubleshooting Spark Errors

    If a Spark application or job fails, you should figure out what went wrong and what exceptions or problems caused it…

  • Caching and Persistence of Data

    Although there are several configurations for Spark, only a few of the most significant and often tuned ones will be…

  • Coverting timestamp and load a table

    We already have a created table and now let's add a column to log the data load. Step 1: List the existing columns…

  • Create a table using a parquet file

    Step by step on how to import a CSV file into a Parquet file and then create a table using PySpark. Step 1: Initialize…

Others also viewed

Explore content categories