EDA using Apache Spark

There are bunch of methods available in Apache Spark for performing Exploratory Data Analysis. In this article we are going to explore some of them.

Dataset Method: describe

This is meant for quick EDA. It computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns.

def describe(cols: String*): DataFrame

See below example –

What if we have column of struct type in dataframe. It will not return those columns in output Dataframe. It only work for numeric or string type of columns.

Dataset Method: summary

Computes specified statistics for numeric and string columns. Input statistics: count, mean, stddev, min, max, arbitrary approximate percentiles (25% 50% 75%).

If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.


def summary(statistics: String*): DataFrame

See below example: Applying summary on selective columns –

Class DataFrameStatFunctions

Statistic functions for DataFrames.

In Dataset class we have method stat – which gives us the instance of DataFrameStatFunctions class.


def stat: DataFrameStatFunctions
Returns a DataFrameStatFunctions for working statistic functions support.

Once we have org.apache.spark.sql.DataFrameStatFunctions object then we can chain methods as per our requirement. There are bunch of stats function available in this class.

stat.freqItems

Finding frequent items for columns, possibly with false positives.


def freqItems(cols: Seq[String]): DataFrame

Example:

stat.cov

Calculate the sample covariance of two numerical columns of a DataFrame.


def cov(col1: String, col2: String): Double

Example:

There are many methods available in DataFrameStatFunctions class, please refer Spark documentation.

References:

https://spark.apache.org/docs/3.0.0/api/scala/org/apache/spark/sql/Dataset.html

https://spark.apache.org/docs/3.0.0/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html

Dataset Used for Analysis: 2022 January - Yellow Taxi Trip Records (PARQUET)

EDA using Apache Spark

Shobhit Singh

Dataset Method: describe

Dataset Method: summary

Class DataFrameStatFunctions

stat.freqItems

stat.cov

References:

More articles by Shobhit Singh

Explore content categories

Dataset Method: describe

Dataset Method: summary

Class DataFrameStatFunctions

stat.freqItems

stat.cov

References:

More articles by Shobhit Singh

Apache Spark Catalyst Optimizer

Apache Spark – SparkSession

Apache Spark – implicits object (Implicits Conversions)

Apache Spark - createDataFrame

Scala Object

Configuring Spark Application

Python Identifiers - Non-ASCII

Explore content categories