Pyspark101
Pypark

Pyspark101

This, post is all about getting started with pyspark and looking at basic commands and functionality.

Spark in itself is vast and looks intimidating when you start learning but, best approach is to start somewhere and this page is all about that.

Before we get started, first make sure your environment has spark, python is installed.

SparkContext

SparkContext is an entry point to start using any functionality of spark. The most important step of any Spark driver application is to generate SparkContext. It allows your Spark Application to access Spark Cluster with the help of Resource Manager (YARN/Mesos).

Now, let’s dive into coding.

No alt text provided for this image

Above code will create or return an existing spark session.

dataset: https://www.kaggle.com/fedesoriano/heart-failure-prediction

Reading Data

Next code, will be to read a csv file.

No alt text provided for this image

While it might sound very straightforward there are many methods to carry out the task and each works in different manner

  1. read.options(‘header’,’true’) : this will read all the data and headers but all columns will be of type string.
  2. read.csv(): this method has multiple parameters as you can see first way to do is using ‘schema 'and Second, using ‘inferSchema’.

So, difference is, by specifying ‘schema’ we can provide spark information of columns and their data types. Using ‘inferSchema’ spark has to do that job. Best way to view difference is running both piece of one after the other and check spark UI. When you run ‘inferSchema’ code you will see a spark job being executed which does not happen when you specify a schema. For large files it is always better to provide schema since this will not execute as a job and it is faster as well.

So, when you run ‘inferSchema’ code your spark UI will look something like this

No alt text provided for this image

Writing Data

  1. df.write.format() : this method will simply create a file and write to it.
  2. df.wriet.format().mode(“overwrite”): this method will overwrite to the existing file.

Both, functions will execute a spark job.

Get Schema

To get column names, data types and if columns accept null values we can use:

No alt text provided for this image

This will return,

No alt text provided for this image

Change column data type

First, import data type and using ‘withColumn’ command along with ‘cast’ command we can change the data type of column

No alt text provided for this image

Remove a column

This can be done using df.drop(‘Age’)

Summary Statistics

To get summary stats about your data we need to select the columns of interest then ‘describe’ and finally ‘show’ command

No alt text provided for this image

This command will return stats of our data in this manner-

No alt text provided for this image

Deal with NA

No alt text provided for this image

  1. Drop all rows that contain any NA values
  2. Drop all rows where all values are NA
  3. Drop all rows where at least 2 or more values as not NA
  4. Drop all rows where any value at specific column(s) is NA
  5. Fill missing values in specific column with a ‘?’
  6. Less crude way is, to use Imputer method. First initiate it with columns you wish to fill nulls in as input and output columns. Then set a replacing strategy such as mean of the column (can also be mode and so on). Then we just need to fit the imputer to the data and transform.

Filtering Data

There are many methods to filter data based on some condition

eg-

No alt text provided for this image

output-

No alt text provided for this image

Other methods-

No alt text provided for this image

  1. using filter command with string that contains column and condition
  2. where is just alias to filter and yields same output
  3. instead of string you can even insert df[col] and apply a condition
  4. for applying multiple condition we can use ‘&’ , ‘|’
  5. using ‘~’ we can pick all the rows that don’t meet the condition.

String Expression

Sometimes it is easier to filter or add based on a string/expression this can be done using ‘expr’

No alt text provided for this image

output-

No alt text provided for this image

Group by

‘groupby’ command allows you to divide your data into groups based on some column, here we have used ‘age’

Then we can calculate statistics for each group, we need to select columns to calculate statistics for.

To sort the data we need to import desc/asc, save ‘groupby’ result in some variable and apply ‘orderBy()’ as shown.

No alt text provided for this image

Using SQL commands

We can even run SQL commands on the data.

No alt text provided for this image

First we need to create temporary view of dataframe. Then, we can filter data from it using ‘spark.sql’.

We can also combine ‘.select’ and ‘.sql’ and choose columns in SQL syntax.


Surely, there are lot more commands to learn like pivot, combining multiple commands and even machine learning using pyspark .

All these topics are covered in this post--

There is lot more to learn but this will surely get you going.

To view or add a comment, sign in

More articles by Darshan Doshi

  • Apache Kafka VS AWS Kinesis

    Recently, I got opportunity to start a project which required streaming services. AWS Kinesis and Apache Kafka were…

Others also viewed

Explore content categories