Window Function with Spark SQL

Sunita Sharma

Published Mar 24, 2020

Use Baby Names Dataset, create a simple visualization that shows what is the most popular first letters baby names to start within each year.

This dataset is from a website referenced by Data.gov. It lists baby names used in the state of NY from 2007 to 2012.

Step1: Copy the data into json file.

import java.net.URL

import java.io.File
import org.apache.commons.io.FileUtils

val tmpFile = new File("/tmp/rows.json")

FileUtils.copyURLToFile(new URL("https://health.data.ny.gov/api/views/jxy9-yhdk/rows.json?accessType=DOWNLOAD"), tmpFile)

Step 2: Use explode function to flatten JSON multiline and store it in temp table = newdata

%python
from pyspark.sql.functions import explode

spark.read.json('/tmp/rows.json', multiLine=True).select(explode("data").alias("data")).createOrReplaceTempView("newdata")

# display and store data in temp table information for babies

sqlContext.sql("SELECT data[0] AS sid, data[1] AS id, data[2] AS position, data[3] AS created_at, data[4] AS created_meta, data[5] AS updated_at, data[6] AS updated_meta, data[7] AS meta, data[8] AS year, data[9] AS first_name, data[10] AS country, data[11] AS sex, data[12] AS name_count FROM newdata").createOrReplaceTempView("babynames")


display(sqlContext.sql("select distinct year from babynames order by year desc limit 5"))

Step 3: Using the tables you created in Step 2 create a simple visualization that shows what is the most popular first letters baby names to start within each year.

%sql
drop table if exists popTab;
-- create table with namss with sum of name count
create table if not exists popTab as
(select  SUBSTR(first_name, 0, 1) name, sum(name_count) cnt ,year from babynames group by name,year)

Step 4: using window function , select the most popular baby name for each year.

%sql

-- use windowfunction to pick top 1 for most popular name for each year

SELECT
 name,
 cnt,
 year
FROM (
  SELECT
     name,
 cnt,
 year,
    dense_rank() OVER (PARTITION BY year ORDER BY cnt DESC) as rank
  FROM popTab ) tmp 
WHERE
  rank == 1
order by year

To view or add a comment, sign in

Window Function with Spark SQL

Sunita Sharma

More articles by Sunita Sharma

Explore content categories

More articles by Sunita Sharma

Data Lake Vs Data Warehouse

Spark Structured Streaming by Example

Object Storage vs. Block Storage

How to use Web call in Spark and Flatten complex Data Structure

Joining datasets using RDD API

Explore content categories