Convert College Scorecard Files to Tableau Hyper Format using pypeds
A few weeks back, there was a request to generate Tableau versions of the College Scorecard datasets.
I previously supported the Scorecard datasets in the python library pypeds via the datasets module, but noticed that the features broke when the data moved to data.ed.gov. What a great excuse to update the library in order to support Jon's request!
Today I am announcing an update to pypeds that now supports the updated College Scorecard data while also providing a simple pathway to generate the updated files as Tableau hyper files.
But wait, I don't know python! That's ok, I created a Google Colab notebook for you. The link can be found below:
Recommended by LinkedIn
All that you have to do is select Runtime > Run All from the menu at the top of the notebook. Google Colab will install the pypeds library for you, download the College Scorecard files, and save out the hyper files to the file browser. No python coding needed! Once the code has completed, all that you have to do is right-click the file in order to download the .hyper file to your machine for user in Tableau!
The resulting .hyper files are large (almost 3,000 columns!) so please be mindful that Tableau may slow down a bit as a result. Also, it is worth noting that if you are comfortable with python, the code shows how to write each dataset as a key in the scorecard dictionary. Each file is a pandas DataFrame, of course. For those of you wondering how this all works, pypeds is using the excellent pantab library under the hood.
That's it! If there are common data cleaning tasks (remove certain variables, recoding values, renaming columns, etc.) that you always perform with the Scorecard data, please do let me know, as pypeds really aims to standardize the extraction and collection of education datasets. If anything does not behave the way that you expected, please don't hesitate to submit bug reports or feature requests on the Github repo for pypeds.
Let me know how it goes!
Brock Tibert - did you remove the scorecard data from the library? When trying to run the notebook, I get an error: 'pypeds.datasets' has no attribute 'scorecard_merged', and checking the GitHub repo, the datasets file doesn't seem to have it. I'm relatively inexperienced in Python, though, so I may be missing it.
Very cool Brock Tibert. Great use of Collab. Thanks for sharing and for continuing to create data sets that are broadly helpful to the higher ed data community.