How to Identify Duplicate Sales Data Using Python
It's exciting to see the Digital Analytics industry start to build some momentum around programmatic analysis techniques. As more conversations happen around the topic, one of the questions that is most often asked is 'yeah but how do you actually use this in the real world?'
So to help answer that question, I've started to build a series of posts focused on using Python to solve real world business problems.
------------------------------------------------
Data Preparation is one of those critical tasks that most digital analysts take for granted as many of the analytics platforms we use take care of this task for us or at least we like to believe they do so. With that said, Data Preparation should be a task that every good analyst completes as part of any data investigation.
Wes McKinney, author of Python for Data Analysis, defines Data Preparation as “cleaning, munging, combining, normalizing, reshaping, slicing, dicing, and transforming data for analysis.”
In this post, I am going to walk you through a real world example, focusing on Data Preparation, of how Python can be a very powerful tool for business focused data analysis.
The Problem
We have been asked to analyze the company’s sales performance for 2015 and have been provided a 300MB Excel Workbook containing millions of rows of sales transactions. We imported the Workbook into Python and while completing a initial round of Data Preparation, flagged several sales transactions that appeared to be duplicates.
Before we move on, we are going to investigate this duplicate issue further to understand how widespread it is and the potential impact it could be having on key metrics such as items sold and total company revenue.
The Dataset
Our dataset contains every order transaction for 2015. The data is structured in such a way that each item purchased, in an order, is a unique row in the data. If an order contained three unique product SKUs, that one order would have three rows in the dataset.
NOTES:
order_id: A unique id that serves as the key to group line items into a single order
order_item_cd: Product SKU
order_item_quantity: The number each SKU purchased for an order
order_item_cost_price: The individual unit price of a SKU purchased
An order that contains duplicated order line items would have a SKU that appears in more than one row of data, for a given order_id, as shown below.
If you are interested in the solution, head over to the 33 Sticks Blog where I detail out the solution as well as provide the full Python code that you are free to download and use, update, change, improve, etc.