Window Function with Spark SQL
Use Baby Names Dataset, create a simple visualization that shows what is the most popular first letters baby names to start within each year.
This dataset is from a website referenced by Data.gov. It lists baby names used in the state of NY from 2007 to 2012.
Step1: Copy the data into json file.
import java.net.URL
import java.io.File
import org.apache.commons.io.FileUtils
val tmpFile = new File("/tmp/rows.json")
FileUtils.copyURLToFile(new URL("https://health.data.ny.gov/api/views/jxy9-yhdk/rows.json?accessType=DOWNLOAD"), tmpFile)
Step 2: Use explode function to flatten JSON multiline and store it in temp table = newdata
%python
from pyspark.sql.functions import explode
spark.read.json('/tmp/rows.json', multiLine=True).select(explode("data").alias("data")).createOrReplaceTempView("newdata")
# display and store data in temp table information for babies
sqlContext.sql("SELECT data[0] AS sid, data[1] AS id, data[2] AS position, data[3] AS created_at, data[4] AS created_meta, data[5] AS updated_at, data[6] AS updated_meta, data[7] AS meta, data[8] AS year, data[9] AS first_name, data[10] AS country, data[11] AS sex, data[12] AS name_count FROM newdata").createOrReplaceTempView("babynames")
display(sqlContext.sql("select distinct year from babynames order by year desc limit 5"))
Step 3: Using the tables you created in Step 2 create a simple visualization that shows what is the most popular first letters baby names to start within each year.
%sql drop table if exists popTab; -- create table with namss with sum of name count create table if not exists popTab as (select SUBSTR(first_name, 0, 1) name, sum(name_count) cnt ,year from babynames group by name,year)
Step 4: using window function , select the most popular baby name for each year.
%sql
-- use windowfunction to pick top 1 for most popular name for each year
SELECT
name,
cnt,
year
FROM (
SELECT
name,
cnt,
year,
dense_rank() OVER (PARTITION BY year ORDER BY cnt DESC) as rank
FROM popTab ) tmp
WHERE
rank == 1
order by year