SQL in the Big Data Exam Room

Jim Adler

Published Dec 11, 2014

As part of a growing list of health maintenance duties, I recently found myself on the receiving end of a routine, decennial medical procedure. Yeah, that one. I was prepped, procedurally explored (ahem), and given a report that (dare I say) visualized the findings.

I was struck by how similar the experience was to a typical data analysis. There were similar steps of preparation, exploration, and visualization. But for data analysis, it’s not me that gets prepped, but the data. Instead of me being scoped, it’s algorithms that do the exploring, and Tableau is great for visualizing the results. And, for each step, SQL is the indispensable instrument in the examination room.

For the last 40 years, SQL has been the workhorse of business intelligence reporting — the lingua franca of data. Although a staple for generating reports, data preparation and exploration have increasingly fallen outside of SQL’s reach with proliferation of new data types, increasingly volumes of data, and predictive analytics like machine learning algorithms.

This is changing with the emergence of data compute engines, where SQL can be used for all three phases of data preparation, exploration, and visualization.

To illustrate, let’s put U.S. federal government contracts on the exam table. We’ll prepare the data with standard SQL and use the well-known KMeans unsupervised learning algorithm implemented in SQL. The resulting clusters can be visualized in Tableau below, using an ODBC data connection. Read on for the invasive details.

Data Prep

Data preparation involves connecting to data sources, cleaning out extraneous noise, and normalizing the data. Connecting to the data is straightforward by simply running SQL statements against the federal contract files on Amazon Web Services (AWS). Also, because we don’t want to cluster on all 225 columns, we create a view of just the columns we want to use for clustering. A SQL query easily restricts the 225 columns in the original data to the four features we’re interested in. Below is a sample of the data in the view:

Note that we map Congressional Districts, of the form CA25, to a state and then to a region, as defined by the Bureau of Economic Analysis. These fact tables are also stored on AWS and are easily accessible by a query.

Algorithmic Exploration

Now that our data is prepped, we can explore it with KMeans clustering. Running KMeans is a one-liner:

RUN kmeans(SELECT 'FedContracts', 'id', 10, 20, 'KMeansCentroids', 'KMeansAssignments');

where

FedContracts is our data;
id is the unique index column;
10 is the number of clusters we want KMeans to find;
20 is the maximum number of steps to iterate;
KMeansCentroids is a table of cluster centroids; and
KMeansAssignments is a table that assigns each data point to a cluster,

kmeans is a Quest pipeline that implements the details of the KMeans algorithm in just six SQL queries. Drop me a note if you’re interested in the details.

Visualizing Results

The Tableau visualization, shown at the top of the post, shows several ways to interpret the resulting KMeans clusters. A few observations:

The Region Map tab shows that cluster C1 represents the West, C2 & C3 the Northeast, C4 the Southeast, C8 the Plains, C9 the Northwest, and C10 the Midwest.
Cluster C5 represents big contracts across the country.
Cluster C6 doesn’t seem to readily cluster into any region. However the box plot, under the Contract Value tab, has the smallest contract value range. This may imply that smaller contracts tend to get spread evenly across congressional districts.
Not sure what cluster C7 is saying. It’s regionally dispersed, and the Contract Value seems to contain bigger deals (higher upper quartile), but not significantly bigger. Sometimes clusters don’t make intuitive sense and require further digging. For example, the number of clusters, 10, may be too small so that C7 spans the extra clusters. Or maybe there’s noise in the data that’s getting captured by C7.
The Employees tab show that federal contractors tend to be of similar size, greater than 100,000 employees.
The Contract Types & Avg Value shows what type of contract vehicle tends to attract the average contract value.

Prognosis

Since the early 1970s, SQL has proven essential for business intelligence reporting. But it can do so much more. SQL is being reimagined for big data preparation and data exploration with the latest machine learning algorithms. With data compute engines, like Quest, SQL will be scoping data for decades to come.

To view or add a comment, sign in

SQL in the Big Data Exam Room

Jim Adler

Data Prep

Algorithmic Exploration

Visualizing Results

Prognosis

More articles by Jim Adler

Others also viewed

Beyond SQL: Why Functional Data Models & Query Languages (FQL) Are Shaping the Future of Analytics

Everyone Talks AI, but SQL Still Runs the Data World

Using Graph Database for Live Purchase Order Tracking – An Experience worth sharing

AI-Driven SQL & Autonomous Databases: The Next Big Frontier in Data Engineering

Structured vs Unstructured Data

Agentic RAGs: consolidated querying of SQL & Document repositories

Smarter SQL with LLMs: How OpenAI is Reshaping Data Engineering

Unleash the Beast: Spark SQL Tuning - It's Like Spark on Steroids!

The Shift towards Data-Aware AI With Snowflake Cortex Code

Five ways to leverage SQL to preprocess data for machine learning

Explore content categories

Data Prep

Algorithmic Exploration

Visualizing Results

Prognosis

More articles by Jim Adler

Innovation Infusion: Toyota Ventures Injects Another $300M Into Climate and Deep Tech

Head in the Clouds, Feet on the Ground: Toyota Ventures at 5 Years

Finding New Mountains: Our Venture Mission of Discovery

Driving Disruption

Announcing Toyota’s First Venture Capital Company

Next Leg: I’ve Joined Toyota Research Institute -- Data and Cars and Robots, Oh My

What’s Limiting Your Performance? Five factors beyond your talent and effort

In the iPhone Debate, Let’s Get the Framing Right

What ... is Your Personal Quest?

Where’s the Big Wisdom in Big Data?

Others also viewed

Beyond SQL: Why Functional Data Models & Query Languages (FQL) Are Shaping the Future of Analytics

Everyone Talks AI, but SQL Still Runs the Data World

Using Graph Database for Live Purchase Order Tracking – An Experience worth sharing

AI-Driven SQL & Autonomous Databases: The Next Big Frontier in Data Engineering

Structured vs Unstructured Data

Agentic RAGs: consolidated querying of SQL & Document repositories

Smarter SQL with LLMs: How OpenAI is Reshaping Data Engineering

Unleash the Beast: Spark SQL Tuning - It's Like Spark on Steroids!

The Shift towards Data-Aware AI With Snowflake Cortex Code

Five ways to leverage SQL to preprocess data for machine learning

Similar topics

Data Cleaning and Preparation

How to Use SQL QUALIFY to Simplify Queries

Explore content categories