EDA using Apache Spark

EDA using Apache Spark

There are bunch of methods available in Apache Spark for performing Exploratory Data Analysis. In this article we are going to explore some of them.

Dataset Method: describe

This is meant for quick EDA. It computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns.

def describe(cols: String*): DataFrame
        

See below example – 

No alt text provided for this image

What if we have column of struct type in dataframe. It will not return those columns in output Dataframe. It only work for numeric or string type of columns.

Dataset Method: summary

Computes specified statistics for numeric and string columns. Input statisticscount, mean, stddev, min, max, arbitrary approximate percentiles (25% 50% 75%).

If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.


def summary(statistics: String*): DataFrame
        

See below example: Applying summary on selective columns –

No alt text provided for this image

Class DataFrameStatFunctions

Statistic functions for DataFrames.

In Dataset class we have method stat – which gives us the instance of DataFrameStatFunctions class.


def stat: DataFrameStatFunctions
Returns a DataFrameStatFunctions for working statistic functions support.
        

Once we have org.apache.spark.sql.DataFrameStatFunctions object then we can chain methods as per our requirement. There are bunch of stats function available in this class.

stat.freqItems

Finding frequent items for columns, possibly with false positives. 


def freqItems(cols: Seq[String]): DataFrame
        

Example:

No alt text provided for this image

stat.cov

Calculate the sample covariance of two numerical columns of a DataFrame.


def cov(col1: String, col2: String): Double
        

Example:

No alt text provided for this image

There are many methods available in DataFrameStatFunctions class, please refer Spark documentation.

References:

https://spark.apache.org/docs/3.0.0/api/scala/org/apache/spark/sql/Dataset.html

https://spark.apache.org/docs/3.0.0/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html

Dataset Used for Analysis: 2022 January - Yellow Taxi Trip Records (PARQUET)

To view or add a comment, sign in

More articles by Shobhit Singh

  • Apache Spark Catalyst Optimizer

    At the core of Spark SQL is the Catalyst Optimizer. Let's explore its different phases.

    1 Comment
  • Apache Spark – SparkSession

    SparkSession was introduced in version Spark 2.0.

  • Apache Spark – implicits object (Implicits Conversions)

    In the SparkSession class there is one object defined as implicits, which extends SQLImplicits abstract class. So once…

  • Apache Spark - createDataFrame

    createDataFrame is an overloaded method present in SparkSession class type (org.apache.

  • Scala Object

    First thing first – Don’t get confused with the instance of an class. When we create an instance of an class, we…

  • Configuring Spark Application

    Apache Spark includes a number of different configurations. Depending on what we are trying to achieve.

  • Python Identifiers - Non-ASCII

    Identifiers i.e.

    1 Comment

Explore content categories