Pyspark101

This, post is all about getting started with pyspark and looking at basic commands and functionality.

Spark in itself is vast and looks intimidating when you start learning but, best approach is to start somewhere and this page is all about that.

Before we get started, first make sure your environment has spark, python is installed.

SparkContext

SparkContext is an entry point to start using any functionality of spark. The most important step of any Spark driver application is to generate SparkContext. It allows your Spark Application to access Spark Cluster with the help of Resource Manager (YARN/Mesos).

Now, let’s dive into coding.

Above code will create or return an existing spark session.

dataset: https://www.kaggle.com/fedesoriano/heart-failure-prediction

Reading Data

Next code, will be to read a csv file.

While it might sound very straightforward there are many methods to carry out the task and each works in different manner

read.options(‘header’,’true’) : this will read all the data and headers but all columns will be of type string.
read.csv(): this method has multiple parameters as you can see first way to do is using ‘schema 'and Second, using ‘inferSchema’.

So, difference is, by specifying ‘schema’ we can provide spark information of columns and their data types. Using ‘inferSchema’ spark has to do that job. Best way to view difference is running both piece of one after the other and check spark UI. When you run ‘inferSchema’ code you will see a spark job being executed which does not happen when you specify a schema. For large files it is always better to provide schema since this will not execute as a job and it is faster as well.

So, when you run ‘inferSchema’ code your spark UI will look something like this

Writing Data

df.write.format() : this method will simply create a file and write to it.
df.wriet.format().mode(“overwrite”): this method will overwrite to the existing file.

Both, functions will execute a spark job.

Get Schema

To get column names, data types and if columns accept null values we can use:

This will return,

Change column data type

First, import data type and using ‘withColumn’ command along with ‘cast’ command we can change the data type of column

Remove a column

This can be done using df.drop(‘Age’)

Summary Statistics

To get summary stats about your data we need to select the columns of interest then ‘describe’ and finally ‘show’ command

This command will return stats of our data in this manner-

Deal with NA

Drop all rows that contain any NA values
Drop all rows where all values are NA
Drop all rows where at least 2 or more values as not NA
Drop all rows where any value at specific column(s) is NA
Fill missing values in specific column with a ‘?’
Less crude way is, to use Imputer method. First initiate it with columns you wish to fill nulls in as input and output columns. Then set a replacing strategy such as mean of the column (can also be mode and so on). Then we just need to fit the imputer to the data and transform.

Filtering Data

There are many methods to filter data based on some condition

eg-

output-

Other methods-

using filter command with string that contains column and condition
where is just alias to filter and yields same output
instead of string you can even insert df[col] and apply a condition
for applying multiple condition we can use ‘&’ , ‘|’
using ‘~’ we can pick all the rows that don’t meet the condition.

String Expression

Sometimes it is easier to filter or add based on a string/expression this can be done using ‘expr’

output-

Group by

‘groupby’ command allows you to divide your data into groups based on some column, here we have used ‘age’

Then we can calculate statistics for each group, we need to select columns to calculate statistics for.

To sort the data we need to import desc/asc, save ‘groupby’ result in some variable and apply ‘orderBy()’ as shown.

Using SQL commands

We can even run SQL commands on the data.

First we need to create temporary view of dataframe. Then, we can filter data from it using ‘spark.sql’.

We can also combine ‘.select’ and ‘.sql’ and choose columns in SQL syntax.

Surely, there are lot more commands to learn like pivot, combining multiple commands and even machine learning using pyspark .

All these topics are covered in this post--

There is lot more to learn but this will surely get you going.

Pyspark101

Darshan Doshi

SparkContext

Reading Data

Writing Data

Get Schema

Change column data type

Remove a column

Summary Statistics

Recommended by LinkedIn

Deal with NA

Filtering Data

String Expression

Group by

Using SQL commands

More articles by Darshan Doshi

Others also viewed

PySpark Lambda Functions - Complete Guide

Accessing Columns in PySpark: A Comprehensive Guide

Dockerize Your PySpark: A Streamlined Local Workflow

Sentiment Analysis with PySpark

Understanding DataFrames in Python and PySpark

My PySpark Job Is Taking Forever… Now What? ⚡

Pandas Data Types: A Detailed Explanation

DataFrames Battle Royale | Pandas vs Polars vs Spark

Pandas vs. Polars: A Detailed Comparison for Data Enthusiasts & introduction to pandasAi

Pyspark RDD logging

Explore content categories

SparkContext

Reading Data

Writing Data

Get Schema

Change column data type

Remove a column

Summary Statistics

Recommended by LinkedIn

Deal with NA

Filtering Data

String Expression

Group by

Using SQL commands

More articles by Darshan Doshi

Apache Kafka VS AWS Kinesis

Others also viewed

PySpark Lambda Functions - Complete Guide

Accessing Columns in PySpark: A Comprehensive Guide

Dockerize Your PySpark: A Streamlined Local Workflow

Sentiment Analysis with PySpark

Understanding DataFrames in Python and PySpark

My PySpark Job Is Taking Forever… Now What? ⚡

Pandas Data Types: A Detailed Explanation

DataFrames Battle Royale | Pandas vs Polars vs Spark

Pandas vs. Polars: A Detailed Comparison for Data Enthusiasts & introduction to pandasAi

Pyspark RDD logging

Explore content categories