Bureau of Labor Statistics: ETL and Analysis Project with Python and Power BI
Data Extraction, Analysis and Visualization of Bureau of Labor Statistics data using the BLS Public API, Python and Power BI
By: Grant Arbuckle
THE PROJECT
For those living in the U.S., I'm sure you've either had or heard conversations regarding the state of the economy in 2023, such as rising prices and minimal wage increases that don't really seem to be keeping up with the inflation rate.
While there are plenty of opinions and projections on these topics, I wanted to dig into and visualize economic data from a reputable source to see these trends for myself and identify correlations between them.
I decided to build a simple #ETL pipeline for this data as well, to automate the extraction, transformation and loading of new data into my visualization as new data became available.
Initial business questions I sought to answer:
ACCESSING THE DATA
I set out to connect to live data through an API so I could simply "refresh" to pull in new data as it becomes available, and visualize this data in PowerBI.
After some research, I found that The Bureau of Labor Statistics (BLS) was a good source of economic data. The Bureau of Labor Statistics measures labor market activity, working conditions, price changes, and productivity in the U.S. economy to support public and private decision making.
After looking at the data tools on their site, I found that the BLS has its data available in a public API, which provides a way to pull data collected from a variety of surveys and categories available on their website through automated scripts, rather than manually choosing file parameters and downloading files of each survey type one by one.
It took me just a couple minutes to register and receive a registration key to use for the API. They have documentation on their site that explain how to send an API request in Python:
I also knew I wanted to pull data for multiple different series, and their site showed which API signature I needed to use to do so:
The data series on the website each have their own unique id, and while the site states that there is no catalog of series ids, they have several pages like this page that show the most popular series downloads for specific categories and their corresponding series ids.
I popped open a Jupyter Notebook to make my API request using the above syntax and parse through the JSON response I received. I found through my parsing that there are different fields depending on the category of the survey/series on the BLS website, so I decided to break out my Python API request scripts by series. For example, the Unemployment Rate series contains fields like "demographic_age" and "demographic_race", and CPI does not.
Because of this, I decided to pull the CPI series (CUUR0000SA0) by itself to start with.
I began with my Python imports and API signature:
I then identified the different parts of the JSON array I needed; "catalog", which contained series-level information, and "data", which contained the actual data point for each period.
After identifying the different elements I needed within the array, I defined my dataframe and wrote some for loops to iterate over the elements in the JSON array and convert it to a pandas dataframe:
I ran df.to_excel() to confirm the dataframe output looked good and the data was ready to be ingested by PowerBI:
I followed a similar process for the other series I wanted, with slight changes made in the scripts to account for differences in available fields.
This may seem like a lot of work to get the data - but as mentioned before, following this process provides a way to get all new data with a simple click of the refresh button in PowerBI, rather than downloading individual files to excel and importing to PowerBI every time new data becomes available!
Once I finished writing the Python scripts for each of the data categories, I was then ready to import the data in PowerBI, transform it, and conduct my analysis.
TRANSFORMING THE DATA
In order to connect to the API and apply my Python upon opening PowerBI desktop, I clicked on "Get Data" at the top left and selected "Python Script" from the options.
Recommended by LinkedIn
From here, it should have been as simple as pasting my python script in the popup box in PowerBI, but instead I received a lengthy "import error". After reviewing the error message in depth and scouring the internet, I finally found this PowerBI community forum page that explained that the error was actually a bug in PowerBI that was causing it to look at the wrong installation path of Python. After reconfiguring my environment variables and making the necessary changes in the PowerBI options, I was able to get the script to run and return the data I was expecting.
The next step was to transform my data using Power Query and perform some calculations to gain better insights. I'll continue using the CPI series as an example.
I applied some steps to calculate a percent change from the previous period, 3 periods ago, and 6 periods ago. This involved first adding an index and two custom columns: the first column to concatenate the year and period together, and the second to pull previous period values while accounting for rows with no previous period.
The next set of steps were performed to create an additional custom column to calculate a percent change from the previous period and change the data type to a percentage. I performed the same set of steps to calculate and add columns for 3-period percentage changes and 6-period percentage changes.
After applying all my steps in PowerQuery, my table now had additional data points to use for my analysis, and these applied steps could be added to my other BLS series I planned to pull in so I didn't have to repeat the process:
VISUALIZATION AND ANALYSIS
Next came the fun part - visualizing the data. I revisited my initial business questions and added KPIs for the CPI series data with business question 1 in mind ("how has the average CPI changed over the past 3 years?"), as well as some other visualizations to further identify and understand the data trends. I added a table with some conditional formatting to identify where large percentage increases began occurring, a line chart for a good visual tracking the CPI changes, and KPI cards for the 3 most recent year CPI averages:
For the visualizations, I made sure to configure the filters to behave in a dynamic manner and continue showing the desired date ranges of data when new data comes available in the future. For example, I used a "Top N" filter on the line chart to always show the 3 most recent years (or 36 periods) of CPI data:
As we can see, the CPI began increasing at a more rapid rate over the past 3 years or so, with a difference of ~12 in average CPI from 2020 to 2021 and a difference of ~21 from 2021 to 2022. This was interesting to see just where exactly the CPI started increasing more rapidly, and begged the question of if the sharp CPI increase trend will continue or if it will stagnate as with the past couple of months. This insight also made me even more interested in how the trends in other economic factors affect CPI.
I have provided a link to the full-size images of all my PowerBI dashboards from this project here and at the bottom of this article.
I followed the same general ETL process for the additional BLS series; wrote python scripts to extract the data from the API, transformed the data within the PowerQuery editor, and loaded the data into tables to visualize in PowerBI.
For my business question number 2, "Which common products increased in price the most over the past year? (Top 5)", it was simple enough to create a line chart with the product types in the legend to show the change in average product prices over time. However, I had to perform the % change calculations differently since each product needed to be calculated separately.
Rather than perform the calculations in Power Query as with the CPI series, I created a calculated measure which I called "% Change, Year" and used PowerBI DAX to write a dynamic formula that would bring back the percent change for the most recent year of data by product type:
Using DAX in this way ensured that anytime I refreshed my data and new periods were pulled in, the percentage change calculation was always up-to-date and correct. This is what my completed Product Price dashboard looked like:
This showed that the 5 common products with the largest percentage increases in price over the past year were: Eggs, with a 70.66% price increase; Oil, 66.75%; Gasoline, 33.99%; Flour, 22.33%; and Chicken, at a 17.95% increase.
Finally, I reviewed business questions 3-5:
These 3 questions could all be answered with the employment and labor data series. After pulling them in to my data model and applying the same steps in Power Query as with the CPI data series, I built out visualizations to answer the questions and provide some additional insight. The single line chart showed the downward trend of the unemployment rate from 2012 until a spike in 2020 (the pandemic), followed by a sharp decline shortly thereafter. The two KPI cards show that the unemployment rate has dropped by -32% over the past year, and -54.9% over the past 10 years - definitely an encouraging insight.
The first bar/line combination chart showed no apparent correlation between the unemployment rate and CPI increases. However, the second bar/line chart showed that while the average hourly earnings increased mostly in line with the CPI from 2012-2020, the CPI increased at a disproportionally higher rate than hourly earnings from 2020-2022. This was interesting, as this shed some light on why the buying power of the dollar seems to have been dwindling recently! These findings also made me glad to have this live data connection, so I can see if the CPI increases level off and begin trending alongside the average wage rate changes again, similar to what the chart shows prior to 2020, in the coming months.
CONCLUSION
There are so many more insights to be uncovered in the BLS data, but this gave me a glimpse of some of the factors at play in our economy and provided a good visual for tracking economic changes over time. Through this project I learned a great deal about communicating with APIs, using Power Query, and creating calculated measures with DAX in PowerBI.
Thank you for checking out my project! Please connect with me Grant Arbuckle and check out my other projects on my portfolio site or my featured section!
Peter McNally
Thank you for the shout-out!
So glad to see you being successful. Great Work!!