Customer Segmentation and Recommendation in Python

Customer Segmentation and Recommendation in Python

Customers are not one giant monolith, all thinking the same way and making the same decisions. Inside any customer base there are usually different factions, each with their own preferences. An advertisement or promotion that is received favorably by one faction may have no effect on other factions. Additionally, with the quantity of data and customers it generally isn't feasible to consider every potential new customer and manually decide which "bucket" this customer falls into. By automatically segmenting customers into different groups based off of data, we can better understand the kinds of customers that constitute the customer base and attempt to cater to each group's specific interests, increasing customer satisfaction (and sales). To achieve this, we will:

  1. Explore the data and clean it up if necessary.
  2. Extract some features from the data that can potentially help us identify customer segments.
  3. Implement unsupervised machine learning techniques to segment our data for us and interpret the results.
  4. Recommend new products to customers based on clustering.

Please feel free to look at my notebook for this post on GitHub.

Explore the data

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

df = pd.read_csv("online_retail.csv", encoding="ISO-8859-1")
df.info()
        
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB
        

This output shows that we have 8 columns of data with over 540,000 rows, but not everything is ready to go. There might be duplicate entries. The InvoiceDate column should probably be a datetime object, but it's not. Also, there are over 100,000 null values for CustomerID and over 1,000 null values for Description. To address this, we will remove duplicate rows, remove the rows with missing CustomerID, convert the InvoiceDate column to datetime, and check to see if we can just use the StockCode column in place of the Description column. I'm throwing out the missing CustomerID rows because the whole point is to segment our customers and the data is not useful for this purpose if we do not know which customer it belongs to.

# remove duplicate rows
df.drop_duplicates(inplace=True)
#  remove rows with missing Customer ID
df = df.dropna(subset=['CustomerID'])
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
# look at rows with stock codes and non-null descriptions
stock_description = df[['StockCode', 'Description']].dropna(subset=['Description'])
# True if stock code and description match, false if they don't match
dup = stock_description.duplicated()
# 
dup.value_counts()        
True     397688
False      3916
Name: count, dtype: int64        

This output indicates that there are almost 4,000 instances where the description and stock code are not identical. If the descriptions for the same stock code are fairly close, then the item is probably the same, but the description was updated. If the descriptions are wildly different then we might have an error with some of the stock codes. Let's take a closer look at some of the stock codes that have different descriptions.

# select rows that were not duplicated and only the stock code column
# then count how many times each stock code appears in the resulting series
non_unique_stock_code = stock_description.loc[~dup,'StockCode'].value_counts()
# for the first five stock codes
for stock_code in non_unique_stock_code.index[:5]:
    print(f"\nStock code: {stock_code}")
    # print the descriptions associated with the stock code
    # and determine how many times they appear in the data
    print(df[df['StockCode'] == stock_code]['Description'].value_counts())
        
Stock code: 23236
Description
STORAGE TIN VINTAGE DOILY      172
DOILEY STORAGE TIN             118
DOILEY BISCUIT TIN              13
STORAGE TIN VINTAGE DOILEY       1
Name: count, dtype: int64

Stock code: 23196
Description
VINTAGE LEAF MAGNETIC NOTEPAD         218
RETRO LEAVES MAGNETIC NOTEPAD          19
RETO LEAVES MAGNETIC SHOPPING LIST      3
LEAVES MAGNETIC  SHOPPING LIST          2
Name: count, dtype: int64

Stock code: 23209
Description
LUNCH BAG VINTAGE DOILY      556
LUNCH BAG DOILEY PATTERN     471
LUNCH BAG VINTAGE DOILEY       4
Name: count, dtype: int64

Stock code: 23535
Description
WALL ART BICYCLE SAFETY     122
BICYCLE SAFTEY WALL ART      39
WALL ART BICYCLE SAFTEY       3
Name: count, dtype: int64

Stock code: 23240
Description
SET OF 4 KNICK KNACK TINS DOILY       328
SET OF 4 KNICK KNACK TINS DOILEY      182
SET OF 4 KNICK KNACK TINS  DOILEY       1
Name: count, dtype: int64
        

So, it looks like there was some disagreement over how to spell doily, but other than that the descriptions appear to be slight variations in describing the same product, so we can proceed to trust the stock codes in this data. It would be interesting to see if changing the product description had any measurable effect on sales, like "Retro" vs. "Vintage" for stock code 23196, but that will be a topic for another post.

We have established that we can trust the stock codes, but what if there are stock codes that represent something other than individual items in the store? Using those could increase error in our clustering. First let's see how long each stock code string is.

# add a column for stock code string length to the data
df['StockCodeLength'] = df['StockCode'].str.len()
# Look at all the unique stock codes
unique_stock_code_index = df['StockCode'].drop_duplicates().index
    # determine how many strings of each length exist, looking only at rows with unique stock codes
df.loc[unique_stock_code_index, 'StockCodeLength'].value_counts()        
StockCodeLength
5     2798
6      877
4        3
1        2
7        1
2        1
12       1
3        1
Name: count, dtype: int64        

The majority of stock codes are five characters long, followed by six characters, then just a few other stock codes that have a different number of characters. We will look at the first few stock codes grouped by string length for more insight.

# Look at rows for unique stock codes only,
# and only include stock code, description, and stock code length columns
(df.loc[unique_stock_code_index, ['StockCode', 'Description', 'StockCodeLength']]
    # sort this view of the data by stock code length, then by stock code
    .sort_values(['StockCodeLength', 'StockCode'])
    # group this data by stock code length
    .groupby('StockCodeLength')
    # display up to the first five rows of each group
    .head())
        
        | StockCode    | Description                  |   StockCodeLength 
-------:|:-------------|:-----------------------------|-----------------:
    141 | D            | Discount                     |                 1 
   2239 | M            | Manual                       |                 1 
   1423 | C2           | CARRIAGE                     |                 2 
 317507 | DOT          | DOTCOM POSTAGE               |                 3 
 317508 | CRUK         | CRUK Commission              |                 4 
 157195 | PADS         | PADS TO MATCH ALL CUSHIONS   |                 4 
     45 | POST         | POSTAGE                      |                 4 
     31 | 10002        | INFLATABLE POLITICAL GLOBE   |                 5 
 103332 | 10080        | GROOVY CACTUS INFLATABLE     |                 5 
   5452 | 10120        | DOGGY RUBBER                 |                 5 
    817 | 10125        | MINI FUNKY DESIGN TAPES      |                 5 
    741 | 10133        | COLOURING PENCILS BROWN TUBE |                 5 
   5451 | 10123C       | HEARTS WRAPPING TAPE         |                 6 
  12492 | 10124A       | SPOTS ON RED BOOKCOVER TAPE  |                 6 
   3973 | 10124G       | ARMY CAMO BOOKCOVER TAPE     |                 6 
  40154 | 15044A       | PINK PAPER PARASOL           |                 6 
   1097 | 15044B       | BLUE PAPER PARASOL           |                 6 
    132 | 15056BL      | EDWARDIAN PARASOL BLACK      |                 7 
   4406 | BANK CHARGES | Bank Charges                 |                12         

From this view, it looks like stock codes with fewer than 5 characters will not be relevant to customer sentiment. Everything else seems to indicate individual products, or color preferences for individual products. Lastly, BANK CHARGES looks like it can also be removed. Let's remove the non-product stock codes.

df = df[df['StockCodeLength'].between(5,7)]
        

Next we will look at InvoiceNo to see if there are abnormalities with that as well.

df['InvoiceNo'].str.len().value_counts()        
InvoiceNo
6    391183
7      8506
Name: count, dtype: int64        

So our invoice numbers have either 6 or 7 characters. Let's see if we can spot any differences.

df[df['InvoiceNo'].str.len() == 6].head()        
Article content
df[df['InvoiceNo'].str.len() == 7].head()        
Article content

It appears that the seven-character invoice numbers are all for cancelled transactions. This is also indicated by a negative quantity. Understanding patterns in cancellation could be helpful for our use case, so we will add a column to the data to indicate if the transaction was cancelled.

df['Cancelled'] = df['Quantity'].apply(lambda x: 0 if x >=0 else 1)
        

Extracting Features

Now that our data is cleaned up, we need to use it to describe each unique customer. One of the industry standard ways of doing that is by looking at how recent each customer purchase was, how frequently this customer makes purchases, and how much money the customer has spent. This is known as RFM.

First, we will look at how long it has been between the last purchase of each customer and the last purchase in the dataset.

# Create new customer features dataframe that holds the most recent purchase date
# for each customer
customer_features = df.groupby('CustomerID')['InvoiceDate'].max().reset_index()
# make the index of this dataframe the customerID
customer_features.set_index('CustomerID', inplace=True)
# rename the latest purchase column
customer_features.rename(columns={'InvoiceDate': 'LatestPurchase'}, inplace=True)

# get the most recent purchase time in the dataset
max_latest_purchase = df['InvoiceDate'].max()

# Determine how much time has passed since the customer's last purchase
customer_features['TimeSinceLastPurchase'] = max_latest_purchase - customer_features['LatestPurchase']
# show us the first few and last few customers sorted by time
customer_features.sort_values('TimeSinceLastPurchase')
        
Article content

Having completed R, let's move on to F. We will determine how long the customer has been making purchases, as well as how many transactions each customer has made. From these data points we can calculate how many purchases per day each customer makes.

# Find the earliest purchase for each customer ID
earliest_purchase = df.groupby('CustomerID')['InvoiceDate'].min()

# Find how many purchases each customer has had
num_purchases = df.groupby('CustomerID')['InvoiceNo'].nunique()

# How many days has each customer been making purchases?
days_purchasing = (max_latest_purchase - earliest_purchase).dt.days
# calculate purchases per day
purchases_per_day = num_purchases / days_purchasing

# Add columns to customer features dataframe
customer_features['EarliestPurchase'] = earliest_purchase
customer_features['NumberOfPurchases'] = num_purchases
customer_features['PurchasesPerDay'] = purchases_per_day
# Fix the few customers who made their only purchase zero days ago
customer_features['PurchasesPerDay'].replace([np.inf], 1, inplace=True)
customer_features
        
Article content

We continue to monetary considerations. Useful data points in this category are the total amount of money spent and the average amount of money per transaction.

df['TotalCost'] = df['UnitPrice'] * df['Quantity']
amount_spent = df.groupby('CustomerID')['TotalCost'].sum()
average_spent = amount_spent / customer_features['NumberOfPurchases']

customer_features['Amount Spent'] = amount_spent
customer_features['Average Spent'] = average_spent
customer_features
        
Article content

Another interesting feature to consider is how many different products each customer has purchased and how frequently they cancel purchases.

# get number of unique products
unique_products = df.groupby('CustomerID')['StockCode'].nunique()

# get number of cancellations
total_cancelled = df.groupby('CustomerID')['Cancelled'].sum()
cancellation_rate = total_cancelled / customer_features['NumberOfPurchases']

# Add to customer features
customer_features['UniqueProducts'] = unique_products
customer_features['CancellationRate'] = cancellation_rate
customer_features
        
Article content

Our customer feature set is now built out and we can prepare it for segmentation! First we need to identify outliers to make sure they don't skew the data. We will use sklearn's Isolation Forest for this example. Unfortunately, this model does not work with datetime data types so we will have to adjust our data to comply.

customer_features.sort_values('PurchasesPerDay',ascending=False).head()        
Article content
# We need days since last purchase, not time
customer_features['DaysSinceLastPurchase'] = customer_features['TimeSinceLastPurchase'].dt.days
model = IsolationForest(random_state=2024)
outlier_score = model.fit_predict(customer_features.loc[:,'NumberOfPurchases':'DaysSinceLastPurchase'])
customer_features['OutlierScore'] = outlier_score
customer_features['OutlierScore'].value_counts()
customer_features_clean = customer_features.loc[customer_features['OutlierScore'] > 0, 'NumberOfPurchases':'DaysSinceLastPurchase'].copy()
        

The model identified 338 / 4025 customers as outliers, tagging them with the value of -1. We can exclude outliers (or choose to include them) moving forward.

Next we will look to see if we have correlations in the data which could negatively impact clustering.

corr = customer_features_clean.corr()

sns.heatmap(corr, annot=True, center=0, cmap='RdBu')
plt.savefig('Feature Correlation.png', bbox_inches='tight')
plt.show()
        
Article content

It looks like some features are correlated, like number of purchases and amount spent. PCA is a good tool to use to account for this, but both PCA and KMeans clustering can be skewed by unscaled data, so we will properly scale our data now.

scaler = StandardScaler()

scaled_customer_features = pd.DataFrame(scaler.fit_transform(customer_features_clean), columns=customer_features_clean.columns, index=customer_features_clean.index)
scaled_customer_features
        
Article content

Now that the data is scaled, we can apply PCA in an attempt to reduce the number of features by reducing colinearity between features.

pca = PCA().fit(scaled_customer_features)
cum_variance = np.cumsum(pca.explained_variance_ratio_)

plt.figure()
plt.plot(np.arange(len(cum_variance))+1, cum_variance, 'o--')
plt.xlabel('Number of components')
plt.ylabel('Cumulative Explained Variance')
plt.savefig('PCA variance.png', bbox_inches='tight')
plt.show()
        
Article content

Looking at the figure above, the first PCA component explains about 40% of the variance in the data, but there is an elbow in this graph indicating that 4 components is probably best, so we will make a new PCA model with 4 components.

pca = PCA(n_components=4)
customer_features_pca = pca.fit_transform(scaled_customer_features)
components = pd.DataFrame(pca.components_, columns=scaled_customer_features.columns)
sns.heatmap(components.T, annot=True, center=0, cmap='BrBG')
plt.savefig('PCA heatmap.png', bbox_inches='tight')
        
Article content

The heatmap above shows how much of a contribution each feature makes (rows) to the new PCA components (columns). Larger magnitudes indicate a larger effect.

Now we are ready for clustering!

Clustering

In this example we are using K-Means clustering. This is an unsupervised machine learning algorithm, which means that we don't tell the model what the "correct" answers are. The algorithm tries to make groups of data points that are close together. Let's figure out how many clusters would be a good fit for our data.

In the cell below we will cluster our data into 2 groups, then 3 groups, up to 11 groups.

c = np.arange(2,11)
sum_squares_within_cluster = []
silhouette = []
for k in c:
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init='auto', random_state=2024).fit(customer_features_pca)
    sum_squares_within_cluster.append(kmeans.inertia_)
    silhouette.append(silhouette_score(customer_features_pca, kmeans.labels_))
        

Now we need a way to compare the groupings to see if our data is better described by 3 or 6 clusters, for example. To achieve this we will consider two metrics: the elbow method using the within cluster sum of squares, and the silhouette method.

First, the elbow method looks at how far away each data point is from the center of its cluster and adds them together. If all data points are very close to their respective clusters, then you would consider it a good fit to the data and the sum would be low. When this sum is plotted as a function of the number of clusters in the model, there will be an "elbow" in the graph when adding an additional cluster doesn't bring down the sum of squares. Let's take a look with our data.

plt.figure()
plt.plot(c, sum_squares_within_cluster, 'o--')
plt.xlabel('Number of clusters')
plt.ylabel('Sum of Squares Within Cluster')
plt.savefig('WCSS.png', bbox_inches='tight')
plt.show()
        
Article content

It looks like there might be an elbow at 6 clusters? Perhaps it's more pronounced at 7 clusters? At any rate, there is no distinct elbow here. We should look at another metric to see if it can help us decide.

The silhouette method looks at the distance between each point and the distance to its cluster as well as the distance to points from other clusters. Every data point is assigned a value between -1 and 1, where values close to +1 mean that the point is well within its own cluster and a value close to -1 means the point is more similar to points that belong to a neighboring cluster. The average value for all points gives us the silhouette score.

plt.figure()
plt.plot(c, silhouette, 'o--')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.savefig('Silhouette.png', bbox_inches='tight')
plt.show()
        
Article content

The silhouette score has a local maximum at 6, also indicating tht 5 features might be best for this data, so we will move forward with 6 features!

# Perform k-means clustering
kmeans = KMeans(n_clusters=6, init='k-means++', n_init='auto', random_state=2024).fit(customer_features_pca)
# Add cluster label to data
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.scatter3D(customer_features_pca[:,0], customer_features_pca[:,1], customer_features_pca[:,2], c=kmeans.labels_, cmap='viridis')
ax.set_xlabel('PCA 0')
ax.set_ylabel('PCA 1')
ax.set_zlabel('PCA 2')
plt.savefig('Clustering.png', bbox_inches='tight')
plt.show()
        
Article content

Recommendations

We have clustered our data! Now we can use these clusters to recommend products. Here's how we can implement the recommendations:

  1. Identify which customer IDs belong in each cluster
  2. Add the cluster ID to each transaction
  3. Find top selling products grouped by cluster
  4. For every customer, determine which top selling products in their cluster that they have not purchased

Let's start by matching customer IDs with clusters.

cluster = pd.Series(kmeans.labels_, index=customer_features_clean.index, name='Cluster')
cluster.value_counts()
        
Cluster
0    1722
2     977
5     574
4     291
3     239
1     222
Name: count, dtype: int64        

Next we need to go back to the transaction data and add the cluster information.

# add cluster data
rec_df = pd.merge(left=df, right=cluster, how='inner', # only keep rows if we have a cluster associated with the customer id
                  on='CustomerID')

rec_df.groupby('Cluster').head(3)        
Article content

Now that we know the cluster of every purchase in the data, we can determine the top-selling products in each cluster.

# Group products by cluster and find total quantity of each stock code sold
top_products = rec_df.groupby(['Cluster', 'StockCode'])['Quantity'].sum().reset_index()
# Sort data by cluster, then by total quantity sold, with higher quantities at the top
top_products = (top_products.sort_values(['Cluster', 'Quantity'], ascending=[True, False])
                # Group sales by cluster
                .groupby('Cluster'))
top_products.head(3)
        
Article content

Now we determine which products each customer has purchased so we can recommend things that they don't have already.

recs = {'CustomerID': [],
        'Cluster': [],
        'StockCode0': [],
        'StockCode1': [],
        'StockCode2': []}
# group purchases by customer so we don't need to do this inside the loop
customer_purchases = rec_df.groupby('CustomerID')

for id in cluster.index:
    recs['CustomerID'].append(id)
    this_cluster = cluster.loc[id]
    recs['Cluster'].append(this_cluster)
    # get the stock codes this customer has purchased
    purchases = customer_purchases.get_group(id)['StockCode'].drop_duplicates()
    # get top selling products for this cluster
    cluster_products = top_products.get_group(this_cluster).iloc[:10,:]
    # find top selling products that the customer has not purchased
    suggested_products = cluster_products[~cluster_products['StockCode'].isin(purchases)].reset_index()
    # Handle cases where customer has purchased almost all of the recommended products
    for i in range(3):
        try:
            recs[f'StockCode{i}'].append(suggested_products.loc[i,'StockCode'])
        except KeyError:
            recs[f'StockCode{i}'].append(0)

customer_recs = pd.DataFrame(recs).set_index('CustomerID')
customer_recs.head()
        
Article content

We've done it! We now have stock codes to recommend to each customer based on what other customers in their group have purchased.

To view or add a comment, sign in

More articles by Jared Carter

Others also viewed

Explore content categories