Spark Structured Streaming by Example

Sunita Sharma

Published May 1, 2020

Spark Structured Streaming: The idea behind Structured Streaming is to treat a live data stream as a table that is being continuously appended. It is based on Spark SQL, Dataset, and Dataframe replacing RDD and DStream APIs.

Spark SQL is a Spark module for structured data processing.

Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.

DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python

Advantages of using structured streaming: Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data.

The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. You can use the Dataset/DataFrame API in Scala, Python, and R.

Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g., declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g., machine learning).

Key Differentiator:

The primary difference between the computation models of Spark SQL and Spark Core is the relational framework for ingesting, querying and persisting (semi)structured data using relational queries (structured queries) that can be expressed in SQL (like features of HiveQL) and the high-level SQL-like functional declarative Dataset API (aka Structured Query DSL).

Example in Python:

Steps 1 > Read JSON file 2 >Print Structure 3>Use Dataframe to display contents

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

# Create the DataFrame
df = sqlContext.read.json("examples/src/resources/reference.json")

# Show the content of the DataFrame
df.show()
## age  name
## null Judy
## 30   Andy
## 19   Mindy

# Print the schema in a tree format
df.printSchema()
## root
## |-- age: long (nullable = true)
## |-- name: string (nullable = true)

# Select only the "name" column
df.select("name").show()
## name
## Judy
## Andy
## Mindy

# Select everybody, but increment the age by 1
df.select(df['name'], df['age'] + 1).show()
## name    (age + 1)
## Judy null
## Andy   31
## Mindy  20

# Select people older than 21
df.filter(df['age'] > 21).show()
## age name
## 30  Andy

# Count people by age
df.groupBy("age").count().show()
## age  count
## null 1
## 19   1
## 30   1

To view or add a comment, sign in

Spark Structured Streaming by Example

Sunita Sharma

More articles by Sunita Sharma

Others also viewed

Databricks and Apache Spark, with Spark Streaming, Twitter4j, DataFrames and Machine Learning (ML) KMeans 1.4. [Part III]

Big Data Multi-class Classification using Apache Spark

Mastering PySpark Transformations: A Simple Guide to a Complex Concept

Understanding PySpark Architecture: A Deep Dive into Distributed Data Processing

The Future of Data Science: Exploring the Potential of PySpark & Databricks

Introduction to PySpark

Pyspark - Adaptive Query Execution(AQE)

Understanding key PySpark ML imports for Regression modeling in Databricks

Demystifying PySpark Structured Streaming: Why Internals Matter for Cost, Scalability, and Fault Tolerance

White Paper - Creating a Virtual CPU on Pyspark: A Paradigm Shift in Data Pre-Processing

Explore content categories

More articles by Sunita Sharma

Data Lake Vs Data Warehouse

Window Function with Spark SQL

Object Storage vs. Block Storage

How to use Web call in Spark and Flatten complex Data Structure

Joining datasets using RDD API