SQL in the Big Data Exam Room
As part of a growing list of health maintenance duties, I recently found myself on the receiving end of a routine, decennial medical procedure. Yeah, that one. I was prepped, procedurally explored (ahem), and given a report that (dare I say) visualized the findings.
I was struck by how similar the experience was to a typical data analysis. There were similar steps of preparation, exploration, and visualization. But for data analysis, it’s not me that gets prepped, but the data. Instead of me being scoped, it’s algorithms that do the exploring, and Tableau is great for visualizing the results. And, for each step, SQL is the indispensable instrument in the examination room.
For the last 40 years, SQL has been the workhorse of business intelligence reporting — the lingua franca of data. Although a staple for generating reports, data preparation and exploration have increasingly fallen outside of SQL’s reach with proliferation of new data types, increasingly volumes of data, and predictive analytics like machine learning algorithms.
This is changing with the emergence of data compute engines, where SQL can be used for all three phases of data preparation, exploration, and visualization.
To illustrate, let’s put U.S. federal government contracts on the exam table. We’ll prepare the data with standard SQL and use the well-known KMeans unsupervised learning algorithm implemented in SQL. The resulting clusters can be visualized in Tableau below, using an ODBC data connection. Read on for the invasive details.
Data Prep
Data preparation involves connecting to data sources, cleaning out extraneous noise, and normalizing the data. Connecting to the data is straightforward by simply running SQL statements against the federal contract files on Amazon Web Services (AWS). Also, because we don’t want to cluster on all 225 columns, we create a view of just the columns we want to use for clustering. A SQL query easily restricts the 225 columns in the original data to the four features we’re interested in. Below is a sample of the data in the view:
Note that we map Congressional Districts, of the form CA25, to a state and then to a region, as defined by the Bureau of Economic Analysis. These fact tables are also stored on AWS and are easily accessible by a query.
Algorithmic Exploration
Now that our data is prepped, we can explore it with KMeans clustering. Running KMeans is a one-liner:
RUN kmeans(SELECT 'FedContracts', 'id', 10, 20, 'KMeansCentroids', 'KMeansAssignments');
where
- FedContracts is our data;
- id is the unique index column;
- 10 is the number of clusters we want KMeans to find;
- 20 is the maximum number of steps to iterate;
- KMeansCentroids is a table of cluster centroids; and
- KMeansAssignments is a table that assigns each data point to a cluster,
kmeans is a Quest pipeline that implements the details of the KMeans algorithm in just six SQL queries. Drop me a note if you’re interested in the details.
Visualizing Results
The Tableau visualization, shown at the top of the post, shows several ways to interpret the resulting KMeans clusters. A few observations:
- The Region Map tab shows that cluster C1 represents the West, C2 & C3 the Northeast, C4 the Southeast, C8 the Plains, C9 the Northwest, and C10 the Midwest.
- Cluster C5 represents big contracts across the country.
- Cluster C6 doesn’t seem to readily cluster into any region. However the box plot, under the Contract Value tab, has the smallest contract value range. This may imply that smaller contracts tend to get spread evenly across congressional districts.
- Not sure what cluster C7 is saying. It’s regionally dispersed, and the Contract Value seems to contain bigger deals (higher upper quartile), but not significantly bigger. Sometimes clusters don’t make intuitive sense and require further digging. For example, the number of clusters, 10, may be too small so that C7 spans the extra clusters. Or maybe there’s noise in the data that’s getting captured by C7.
- The Employees tab show that federal contractors tend to be of similar size, greater than 100,000 employees.
- The Contract Types & Avg Value shows what type of contract vehicle tends to attract the average contract value.
Prognosis
Since the early 1970s, SQL has proven essential for business intelligence reporting. But it can do so much more. SQL is being reimagined for big data preparation and data exploration with the latest machine learning algorithms. With data compute engines, like Quest, SQL will be scoping data for decades to come.