DataFrame and DataSet in Spark

Jyoti Bajpai

Published Jun 20, 2018

Spark introduced DataFrame in version 1.3 to store structured data in distributed mode just like RDDs. So basically DataFrame was introduced as an extension to existing RDDs with the named columns.The concept of dataFrame was taken from Pyhton(Pandas) and R .

DataFrame is a distributed collection of organized data similar to table in relational database which provide optimization internally. DataFrames also have feature of immutability ,in-memory computation and lazy evaluation as RDDs. It enable Spark to mange the schema of data.Its API provides several methods for transformations and action which makes an engineer life easier.

Keypoints about Dataframe

-It can scale data from Kilobytes on a single to petabytes on large cluster.
-Optimization and code generation through Spark-SQL-Catalyst Optimizer.
-It can be consurtucted from various data sources such as

- Existing RDDs
Structured data files(AVRO,CSV,JSON)
Hives tables
External Databases

DataFrame APIs are available in Scala,Java,Python and R.
It is not compile static.
After transforming an object to DataFrame ,it can not be reconstructed.

Under the hood Spark's Execution engine apply Project Tungston for memory management and performance optimization.It eliminates the overhead of JVM garbage collection as it stores the object off heap memory in binary format.

DataFrames achieve the performance optimization using catalyst optimizer which create a query plan in below 4 phases.

Analyzing a Logical plan
optimizing plan
Physical planning
Code generation to compile parts of queries to java bytecode

DataSet-

Spark introduces DataSet in Spark 1.6 release to overcome the limitation of DataSet. DataSet is an extension to DataFrame API which is type-safe.It is a collection of strongly types JVM objects which is represented in tabular format through endcoder. It consist of all the feature of an RDD, such as DataSet are also immutable ,perform lazy evaluation and in memory computation.

Similar to DataFrames ,DataSets also processes the structured and unstructured data efficienly.It can read and write data from various file format.It provides facility of regenerating RDD from DataFrame.

DataSets APIs are availabe in Scala and Java which provides several method to perform DataSet operations.

Unified Apache API

In Spark 2.0 release ,Spark merges DataFrame API and with DataSet. Now DataSet API has two characteristics ,one is strongly-typed API and another is untyped API. DataFrame used as an alias for collection of generic objects DataSet[Row] where as DataSet is collection of Strongly-typed JVM objects which is available in Scala and Java.

Since Python and R does not provide compile time safety , it supports only Untyped API i.e. DataFrame.

DataFrame Operations (Untyped DataSet Operations)

SparkSession is the entry point to all the functionality of Spark. Prior to Spark 2.0 release we need to create SqlContext to use the functionality of SparkSql.

val sparkSession= SparkSession.builder().appName("rating").getOrCreate()

creating a dataFrame in Scala by reading a csv file -

val df= val df=sparkSession.read.csv("C:\\SparkCourse\\sample.csv")

Now you can perform various operations on DataFrame for data processing and analysis using SparkSql API .For example call printSchema() to view the the schema of dataFrame.

df.printSchema()

It will display the schema of DataFrame as shown below:

root

|-- _c0: string (nullable = true)

|-- _c1: string (nullable = true)

You can provide all other required parameters in option() method while creating DataFrame such as header property ,delimeters.

val df= val df=sparkSession.read.option("header",true).csv("C:\\SparkCourse\\sample.txt")

Now df.printSchema() will display the below output:

root

|-- name: string (nullable = true)

|-- age: string (nullable = true)

Use show() methd to see the DataFrame's data.

df.show()

To Perform operations on particular column of DataFrame,Spark provide Select() method.

df.select("name").show()

DataFrame and DataSet in Spark

Jyoti Bajpai

DataSet-

Unified Apache API

DataFrame Operations (Untyped DataSet Operations)

Others also viewed

Apache Spark : RDD vs DataFrame vs Dataset

Data Engineering's Coming of Age: 14 Trends Defining the Next 5 Years

An Introduction to PySpark

From Scripts That Run to Systems That Scale

Apache Spark Concepts Every Data Engineer Should Actually Understand (Not Just Memorize)

Is Spark overkill? Why I’m replacing distributed clusters with DuckDB for local pipelines.

Using Delta Lake with Spark in Azure Synapse Analytics: A Comprehensive Guide

Decode Pyspark

Understanding RDDs, DataFrames, and Datasets

Pyspark RDD logging

Explore content categories

DataSet-

Unified Apache API

DataFrame Operations (Untyped DataSet Operations)

Others also viewed

Apache Spark : RDD vs DataFrame vs Dataset

Data Engineering's Coming of Age: 14 Trends Defining the Next 5 Years

An Introduction to PySpark

From Scripts That Run to Systems That Scale

Apache Spark Concepts Every Data Engineer Should Actually Understand (Not Just Memorize)

Is Spark overkill? Why I’m replacing distributed clusters with DuckDB for local pipelines.

Using Delta Lake with Spark in Azure Synapse Analytics: A Comprehensive Guide

Decode Pyspark

Understanding RDDs, DataFrames, and Datasets

Pyspark RDD logging

Similar topics

Tips for Optimizing Apache Spark Performance

How to Optimize Pyspark Job Performance

Explore content categories