How to Identify Duplicate Sales Data Using Python

How to Identify Duplicate Sales Data Using Python

It's exciting to see the Digital Analytics industry start to build some momentum around programmatic analysis techniques. As more conversations happen around the topic, one of the questions that is most often asked is 'yeah but how do you actually use this in the real world?'

So to help answer that question, I've started to build a series of posts focused on using Python to solve real world business problems. 

------------------------------------------------

Data Preparation is one of those critical tasks that most digital analysts take for granted as many of the analytics platforms we use take care of this task for us or at least we like to believe they do so. With that said, Data Preparation should be a task that every good analyst completes as part of any data investigation.

Wes McKinney, author of Python for Data Analysis, defines Data Preparation as “cleaning, munging, combining, normalizing, reshaping, slicing, dicing, and transforming data for analysis.”

In this post, I am going to walk you through a real world example, focusing on Data Preparation, of how Python can be a very powerful tool for business focused data analysis.

 

The Problem

We have been asked to analyze the company’s sales performance for 2015 and have been provided a 300MB Excel Workbook containing millions of rows of sales transactions. We imported the Workbook into Python and while completing a initial round of Data Preparation, flagged several sales transactions that appeared to be duplicates.

Before we move on, we are going to investigate this duplicate issue further to understand how widespread it is and the potential impact it could be having on key metrics such as items sold and total company revenue.

 

The Dataset

Our dataset contains every order transaction for 2015. The data is structured in such a way that each item purchased, in an order, is a unique row in the data. If an order contained three unique product SKUs, that one order would have three rows in the dataset.

NOTES:
order_id: A unique id that serves as the key to group line items into a single order
order_item_cd: Product SKU
order_item_quantity: The number each SKU purchased for an order
order_item_cost_price: The individual unit price of a SKU purchased

An order that contains duplicated order line items would have a SKU that appears in more than one row of data, for a given order_id, as shown below.

 

If you are interested in the solution, head over to the 33 Sticks Blog where I detail out the solution as well as provide the full Python code that you are free to download and use, update, change, improve, etc. 

To view or add a comment, sign in

More articles by jason thompson

  • 8 Lesson From 8 Years Building an Analytics Agency

    This month marks 8 years building 33 Sticks. Lots of ups, a fair share of downs, many mistakes along the way, and some…

    11 Comments
  • Sell Like a Farmer

    For the last several years, I have been actively building the business of my dreams. Over those years, one thing has…

    4 Comments
  • If You Want To Lead, You Must Learn How To Follow

    For as long as I can remember, I have loved the mountains and I am very grateful for the lessons that the mountains…

    2 Comments
  • Greatness Looks Effortless

    Imagine Michael Phelps in the pool. Imagine Usain Bolt on the track.

  • Do It Your Way

    I had lunch with a friend yesterday at a small deli, that is tucked into the far corner of a local coffee shop. If you…

    7 Comments
  • Retaining Talent by Killing the Billable Hour

    Many people commented on my last post, about retaining analytics talent, that a key to retaining talented employees is…

    9 Comments
  • Doing Your Own Thing Isn’t As Scary As You Think

    One of the questions I get asked the most is, “isn’t doing your own thing scary?” The answer is NO, it’s not scary…

    5 Comments
  • Thinking Strategically About Adobe DTM

    If you attended Adobe Summit, or happened to be anywhere around the event, chances are pretty good you heard something…

    5 Comments
  • Are Web Analysts Becoming the New IT?

    I started out my career working in the IT department for a large networking company and during that time I watched IT…

    7 Comments
  • Anatomy of a Poorly Executed Email Campaign

    By Jason Thompson, co-founder, 33 Sticks I recently received an email from a large analytics vendor introducing a new…

Others also viewed

Explore content categories