Window Function with Spark SQL

Use Baby Names Dataset, create a simple visualization that shows what is the most popular first letters baby names to start within each year.

This dataset is from a website referenced by Data.gov. It lists baby names used in the state of NY from 2007 to 2012.

Step1: Copy the data into json file.

import java.net.URL
import java.io.File
import org.apache.commons.io.FileUtils

val tmpFile = new File("/tmp/rows.json")

FileUtils.copyURLToFile(new URL("https://health.data.ny.gov/api/views/jxy9-yhdk/rows.json?accessType=DOWNLOAD"), tmpFile)

Step 2: Use explode function to flatten JSON multiline and store it in temp table = newdata

%python
from pyspark.sql.functions import explode

spark.read.json('/tmp/rows.json', multiLine=True).select(explode("data").alias("data")).createOrReplaceTempView("newdata")

# display and store data in temp table information for babies

sqlContext.sql("SELECT data[0] AS sid, data[1] AS id, data[2] AS position, data[3] AS created_at, data[4] AS created_meta, data[5] AS updated_at, data[6] AS updated_meta, data[7] AS meta, data[8] AS year, data[9] AS first_name, data[10] AS country, data[11] AS sex, data[12] AS name_count FROM newdata").createOrReplaceTempView("babynames")


display(sqlContext.sql("select distinct year from babynames order by year desc limit 5"))

Step 3: Using the tables you created in Step 2 create a simple visualization that shows what is the most popular first letters baby names to start within each year.

%sql
drop table if exists popTab;
-- create table with namss with sum of name count
create table if not exists popTab as
(select  SUBSTR(first_name, 0, 1) name, sum(name_count) cnt ,year from babynames group by name,year) 

Step 4: using window function , select the most popular baby name for each year.

%sql

-- use windowfunction to pick top 1 for most popular name for each year

SELECT
 name,
 cnt,
 year
FROM (
  SELECT
     name,
 cnt,
 year,
    dense_rank() OVER (PARTITION BY year ORDER BY cnt DESC) as rank
  FROM popTab ) tmp 
WHERE
  rank == 1
order by year


To view or add a comment, sign in

More articles by Sunita Sharma

  • Data Lake Vs Data Warehouse

    FYI : Just a refresher with important data points to know. Data Lake stores all data irrespective of the source and its…

    1 Comment
  • Spark Structured Streaming by Example

    Spark Structured Streaming: The idea behind Structured Streaming is to treat a live data stream as a table that is…

  • Object Storage vs. Block Storage

    What is Object Storage? Object storage, or object-based storage packages the data and metadata into objects. Metadata…

    1 Comment
  • How to use Web call in Spark and Flatten complex Data Structure

    The example shows making a web call to health data from the public website https://health.data.

  • Joining datasets using RDD API

    Joining 2 datasets with Spark SQL is pretty simple, It gets tricky when you use Spark RDD. Example uses TPCH dataset…

Explore content categories