Data Science Quick Tip #003: Using Scikit-Learn Pipelines!

David Hundley

Published Aug 14, 2020

Hello again, lovely people! We’re back this week with another data science quick tip, and this one is sort of a two parter. In this first part, we’ll be covering how to use Scikit-Learn pipelines with Scikit-Learn’s barebones transformers, and in the next part, I’ll teach you how to use your own custom data transformers within this same pipeline framework. (Stay tuned for that post!)

Before getting into things, let me share my GitHub for this post in case you want to follow along more closely. I’ve also included the data we’ll be working with as well. Check it all out at this link.

As always, let’s start with the intuition on why you would want to leverage something like this. I’m assuming that you’re already familiar with the concept that predictive models are often exported into binary pickle files. These binaries are then later imported for use elsewhere in things like APIs. That way when you receive data through said API, your deserialized pickle can perform the prediction and ship the results back to your eager, smiling user.

But it’s not always the case that your data will come ready-to-go into your API! In many cases you’ll have to do a little preprocessing work before it can be appropriately put through the model for predictive purposes. I’m talking about things like one-hot encoding, scaling, imputing, and more. The Scikit-Learn package offers a number of these transformers to use, but if you do NOT use a pipeline, you’ll have to serialize each individual transformer. In the end, you could end up with like 6–7 serialized pickle files. Not ideal!

Fortunately, this is where Scikit-Learn’s Pipelines come to the rescue. Using pipelines, you can have your data step through each necessary step it needs to take and get pushed through the model at the end, all in one place. At the end of the pipeline, you get one serialized pickle file that will do all the appropriate transformations for you. Handy! Learning about these Pipelines can be a little tricky the first time you see it, but once you get the hang of things, you can do a lot of powerful stuff with them.

So for our project here today, we’ll be making use of the Titanic dataset from Kaggle. This is a commonly used dataset for learning purposes as it is generally easy to understand. As you might be guessing, yes, we are indeed talking about the same Titanic made into the infamous movie starring Leo DiCaprio and Kate Winslet. (Yes, he lets me call him Leo.)

The dataset contains a mix of features all about each individual passenger aboard the ship and whether or not they survived. Most people use this dataset as a means to learn about supervised learning since we have a clearly defined target variable (“Survived”). While we’re interested in learning how to use Scikit-Learn’s pipelines, I wouldn’t say we’re overly interested in getting highly accurate results right now. In fact, we’ll be omitting so many features from this dataset that you definitely shouldn’t bank on our model being any better than a coin toss.

To start off the project, we’ll be importing all the libraries we’ll be using as well as the training dataset. We’ll also go ahead and do a split on the training dataset into training and validation sets just so you can see how inference will work later on. Don’t worry, it might seem daunting that we’re importing all these things, but relatively speaking, each piece is playing a small role here.

Alright, as mentioned before, we’re going to keep things pretty simple for this project. For our pipeline, we’re going to have three basic steps: data preprocessing, data scaling, and predictive modeling using a random forest classifier.

First up, let’s tackle data preprocessing with a very basic transformation on a single feature. The “Sex” (or gender) feature houses a binary “male” or “female” characteristic for each person in the dataset, and since we know our model needs to work with numerical data, we’ll perform a simple one-hot encoding on this feature. Let’s set up our data preprocessor to do just this.

Again, this is a pretty poor thing to do in the real world since we’re making use of only a single feature, but the reason we’ve set up our transformer as such is because we can later add custom transformers to this same data preprocessor to do some extra fun things with more features. Now, I don’t want to muddy the waters with doing too much in one post, so we’ll cover that in the next one.

Generally speaking, the data preprocessing step is the most complex and why we separated it out into its own thing. The other steps are pretty straightforward, so now that we’ve created our data preprocessor, we can go ahead and bundle in our data scaler and RandomForestClassifier into a unified pipeline. The Pipeline will step the data through each of these steps as you have ordered them, so make sure you have your steps in the correct order! (Don’t worry about the model hyperparameters. I just copy / pasted them from another thing I once worked on.)

And friends, that’s pretty much it. This new Pipeline object is pretty easy to interact with as it inherits the functions associated with the last step in the pipeline. In our case, that means fitting our data just as we would fit the data with a RandomForestClassifier not housed in a pipeline. You can also see in the output the steps that the Pipeline went through in fitting the data appropriately.

Now that we’ve fit the data to the pipeline, it’s ready to be serialized and used for inference purposes. Just to show that this was successful, I dumped the model into a single serialized pickle file, deserialized the pickle file into a new model object, and used the new model object against our validation set.

Interestingly enough, our accuracy and ROC AUC scores weren’t too shabby, but I still know for a fact this is an awful predictor of a model. Although if memory serves me correctly, I believe women were given priority for rescue off the Titanic, so perhaps gender alone isn’t a bad predictor as we’ve seen above. Still, I would never throw my eggs into the single-feature basket.

And that wraps up this post! Hope you all got some value out of learning about Scikit-Learn’s pipeline feature. As mentioned above, stay tuned next time for a post that takes this a step further to use custom transformers within this pipeline framework. Then we’ll really be cooking with gas!

To view or add a comment, sign in

Data Science Quick Tip #003: Using Scikit-Learn Pipelines!

David Hundley

More articles by David Hundley

Others also viewed

Week 16 of Data Science: Decision Tree and Support Vector Machine

How we inform and inspire ourselves as Data Scientists

Tools for Smart/Lazy Data Scientists (ft. LazyPredict)

Why I Choose Data Science as My Career

Data Science requires heavy dose of statistics not less

Data Science What?

The pioneers of data science, do you know them?

Ten Things to Try in 2017: New Years Resolutions for the Intermediate Data Scientist

Making data science a team sport

Explore content categories

More articles by David Hundley

An Extremely Simple Way to Think About Business

Six Ways to Harden Your Model-Serving API with Tests & Scans

Seven Tips for Crafting a Great Data Science Resume

Five Tips for Overcoming Imposter Syndrome in the Data Science World

Terraform + SageMaker Part 2a: Creating a Custom SageMaker Notebook Instance

Terraform + SageMaker Part 1b: Initialization with Terraform Cloud

Data Science Quick Tips #012: Creating a Machine Learning Inference API with FastAPI

Four Skills to Start Your Data Science Learning Path

Terraform + SageMaker Part 1: Terraform Initialization

iPad Pro + Raspberry Pi for Data Science Part 4: Installing Kubernetes for Learning Purposes