EDA using Apache Spark
There are bunch of methods available in Apache Spark for performing Exploratory Data Analysis. In this article we are going to explore some of them.
Dataset Method: describe
This is meant for quick EDA. It computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns.
def describe(cols: String*): DataFrame
See below example –
What if we have column of struct type in dataframe. It will not return those columns in output Dataframe. It only work for numeric or string type of columns.
Dataset Method: summary
Computes specified statistics for numeric and string columns. Input statistics: count, mean, stddev, min, max, arbitrary approximate percentiles (25% 50% 75%).
If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
def summary(statistics: String*): DataFrame
See below example: Applying summary on selective columns –
Class DataFrameStatFunctions
Statistic functions for DataFrames.
In Dataset class we have method stat – which gives us the instance of DataFrameStatFunctions class.
def stat: DataFrameStatFunctions
Returns a DataFrameStatFunctions for working statistic functions support.
Once we have org.apache.spark.sql.DataFrameStatFunctions object then we can chain methods as per our requirement. There are bunch of stats function available in this class.
stat.freqItems
Finding frequent items for columns, possibly with false positives.
def freqItems(cols: Seq[String]): DataFrame
Example:
stat.cov
Calculate the sample covariance of two numerical columns of a DataFrame.
def cov(col1: String, col2: String): Double
Example:
There are many methods available in DataFrameStatFunctions class, please refer Spark documentation.
References:
Dataset Used for Analysis: 2022 January - Yellow Taxi Trip Records (PARQUET)