Pyspark101
This, post is all about getting started with pyspark and looking at basic commands and functionality.
Spark in itself is vast and looks intimidating when you start learning but, best approach is to start somewhere and this page is all about that.
Before we get started, first make sure your environment has spark, python is installed.
SparkContext
SparkContext is an entry point to start using any functionality of spark. The most important step of any Spark driver application is to generate SparkContext. It allows your Spark Application to access Spark Cluster with the help of Resource Manager (YARN/Mesos).
Now, let’s dive into coding.
Above code will create or return an existing spark session.
Reading Data
Next code, will be to read a csv file.
While it might sound very straightforward there are many methods to carry out the task and each works in different manner
So, difference is, by specifying ‘schema’ we can provide spark information of columns and their data types. Using ‘inferSchema’ spark has to do that job. Best way to view difference is running both piece of one after the other and check spark UI. When you run ‘inferSchema’ code you will see a spark job being executed which does not happen when you specify a schema. For large files it is always better to provide schema since this will not execute as a job and it is faster as well.
So, when you run ‘inferSchema’ code your spark UI will look something like this
Writing Data
Both, functions will execute a spark job.
Get Schema
To get column names, data types and if columns accept null values we can use:
This will return,
Change column data type
First, import data type and using ‘withColumn’ command along with ‘cast’ command we can change the data type of column
Remove a column
This can be done using df.drop(‘Age’)
Summary Statistics
To get summary stats about your data we need to select the columns of interest then ‘describe’ and finally ‘show’ command
This command will return stats of our data in this manner-
Recommended by LinkedIn
Deal with NA
Filtering Data
There are many methods to filter data based on some condition
eg-
output-
Other methods-
String Expression
Sometimes it is easier to filter or add based on a string/expression this can be done using ‘expr’
output-
Group by
‘groupby’ command allows you to divide your data into groups based on some column, here we have used ‘age’
Then we can calculate statistics for each group, we need to select columns to calculate statistics for.
To sort the data we need to import desc/asc, save ‘groupby’ result in some variable and apply ‘orderBy()’ as shown.
Using SQL commands
We can even run SQL commands on the data.
First we need to create temporary view of dataframe. Then, we can filter data from it using ‘spark.sql’.
We can also combine ‘.select’ and ‘.sql’ and choose columns in SQL syntax.
Surely, there are lot more commands to learn like pivot, combining multiple commands and even machine learning using pyspark .
All these topics are covered in this post--
There is lot more to learn but this will surely get you going.