How to Identify Duplicate Sales Data Using Python

jason thompson

Published Jan 18, 2016

It's exciting to see the Digital Analytics industry start to build some momentum around programmatic analysis techniques. As more conversations happen around the topic, one of the questions that is most often asked is 'yeah but how do you actually use this in the real world?'

So to help answer that question, I've started to build a series of posts focused on using Python to solve real world business problems.

------------------------------------------------

Data Preparation is one of those critical tasks that most digital analysts take for granted as many of the analytics platforms we use take care of this task for us or at least we like to believe they do so. With that said, Data Preparation should be a task that every good analyst completes as part of any data investigation.

Wes McKinney, author of Python for Data Analysis, defines Data Preparation as “cleaning, munging, combining, normalizing, reshaping, slicing, dicing, and transforming data for analysis.”

In this post, I am going to walk you through a real world example, focusing on Data Preparation, of how Python can be a very powerful tool for business focused data analysis.

The Problem

We have been asked to analyze the company’s sales performance for 2015 and have been provided a 300MB Excel Workbook containing millions of rows of sales transactions. We imported the Workbook into Python and while completing a initial round of Data Preparation, flagged several sales transactions that appeared to be duplicates.

Before we move on, we are going to investigate this duplicate issue further to understand how widespread it is and the potential impact it could be having on key metrics such as items sold and total company revenue.

The Dataset

Our dataset contains every order transaction for 2015. The data is structured in such a way that each item purchased, in an order, is a unique row in the data. If an order contained three unique product SKUs, that one order would have three rows in the dataset.

NOTES:
order_id: A unique id that serves as the key to group line items into a single order
order_item_cd: Product SKU
order_item_quantity: The number each SKU purchased for an order
order_item_cost_price: The individual unit price of a SKU purchased

An order that contains duplicated order line items would have a SKU that appears in more than one row of data, for a given order_id, as shown below.

If you are interested in the solution, head over to the 33 Sticks Blog where I detail out the solution as well as provide the full Python code that you are free to download and use, update, change, improve, etc.

To view or add a comment, sign in

How to Identify Duplicate Sales Data Using Python

jason thompson

The Problem

The Dataset

More articles by jason thompson

Others also viewed

A smart Hack for Correlation Analysis in Power BI

Handling Missing Data In Spark Using Python

Why SAS is No Longer the Go-To for Market Research: Embracing Python, R, and Modern Platforms

Titanic Dataset Analysis: Exploring Data with Python and Pandas

My 277 Movies: a Data Analysis

Stepping up my SQL game

🔍 What is EDA (Exploratory Data Analysis)?

From Raw Data to Insights using Python Pandas

Visualizing Data

Data Cleaning with Pandas: A Beginner's Journey

Explore content categories

The Problem

The Dataset

More articles by jason thompson

8 Lesson From 8 Years Building an Analytics Agency

Sell Like a Farmer

If You Want To Lead, You Must Learn How To Follow

Greatness Looks Effortless

Do It Your Way

Retaining Talent by Killing the Billable Hour

Doing Your Own Thing Isn’t As Scary As You Think

Thinking Strategically About Adobe DTM

Are Web Analysts Becoming the New IT?

Anatomy of a Poorly Executed Email Campaign

Others also viewed

A smart Hack for Correlation Analysis in Power BI

Handling Missing Data In Spark Using Python

Why SAS is No Longer the Go-To for Market Research: Embracing Python, R, and Modern Platforms

Titanic Dataset Analysis: Exploring Data with Python and Pandas

My 277 Movies: a Data Analysis

Stepping up my SQL game

🔍 What is EDA (Exploratory Data Analysis)?

From Raw Data to Insights using Python Pandas

Visualizing Data

Data Cleaning with Pandas: A Beginner's Journey

Similar topics

How to Use Python for Real-World Applications

Sales Performance Analysis with Data Metrics

Python Tools for Improving Data Processing

Explore content categories