Spark Structured Streaming by Example
Spark Structured Streaming: The idea behind Structured Streaming is to treat a live data stream as a table that is being continuously appended. It is based on Spark SQL, Dataset, and Dataframe replacing RDD and DStream APIs.
Spark SQL is a Spark module for structured data processing.
Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python
Advantages of using structured streaming: Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data.
The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. You can use the Dataset/DataFrame API in Scala, Python, and R.
Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g., declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g., machine learning).
Key Differentiator:
The primary difference between the computation models of Spark SQL and Spark Core is the relational framework for ingesting, querying and persisting (semi)structured data using relational queries (structured queries) that can be expressed in SQL (like features of HiveQL) and the high-level SQL-like functional declarative Dataset API (aka Structured Query DSL).
Example in Python:
Steps 1 > Read JSON file 2 >Print Structure 3>Use Dataframe to display contents
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
# Create the DataFrame
df = sqlContext.read.json("examples/src/resources/reference.json")
# Show the content of the DataFrame
df.show()
## age name
## null Judy
## 30 Andy
## 19 Mindy
# Print the schema in a tree format
df.printSchema()
## root
## |-- age: long (nullable = true)
## |-- name: string (nullable = true)
# Select only the "name" column
df.select("name").show()
## name
## Judy
## Andy
## Mindy
# Select everybody, but increment the age by 1
df.select(df['name'], df['age'] + 1).show()
## name (age + 1)
## Judy null
## Andy 31
## Mindy 20
# Select people older than 21
df.filter(df['age'] > 21).show()
## age name
## 30 Andy
# Count people by age
df.groupBy("age").count().show()
## age count
## null 1
## 19 1
## 30 1