Federated Learning for Anomaly Detection in E-commerce: A ML- SQL Approach
E-commerce companies own multiple web properties—like online storefront or retailers that manage numerous physical stores, face numerous challenges in understanding their customers. They turn to Federated Learning as a powerful approach to harness decentralized data while preserving privacy across their business challenges.
Federated Learning (FL) is a machine learning paradigm where multiple clients (e.g., stores, data silos, or edge nodes) collaboratively train a model while keeping their data decentralized. Instead of sharing raw data , clients train models locally and share only model updates (like gradients or weights), which aggregates these updates to improve a global model.
Federate learning is preferred as it ensures :
In essence, federated learning is preferred because it leverages the distributed nature of multi-store retailers or multi-web-property companies, turning a potential challenge (scattered data) into a strength (local insights with global benefits). It’s a privacy-first, cost-effective, and scalable solution tailored to their operational reality.
Understanding Federated Learning
Federated learning is a distributed machine learning approach where a global model is trained across multiple clients (e.g. retail stores, or web properties) without centralizing their data. Instead, each client trains a local model, and only the model updates are aggregated to improve the global model. Here’s how it works step-by-step:
Why It works for E-commerce
For a company with multiple stores or web properties, each D could represent local customer data. The math ensures that the global model learns from all locations without ever needing to see their raw data, balancing local relevance with global accuracy.
This is the core of federated learning—distributed optimization with a privacy-first twist!
Potential uses case in E-Commerce for Federated learning
These use cases can use federated learning’s ability to balance personalization and efficiency with privacy, making it particularly valuable in e-commerce where customer trust and data security are paramount.
In this blog lets build an Anomaly detection use case for an e-commerce web property.
Federated learning for Anomaly Detection
In our e-commerce company, customers interact through multiple web storefronts, each representing a distinct web property. All data from these storefronts are kept in separate dataset locally or in some web properties in the same region consolidated into a single dataset, with a specific field—storeCode—indicating which web property recorded the activity, whether it’s a page view, cart transaction, or other interactions. This structure allows us to analyze customer behavior in the context of web properties while maintaining data granularity.
Let's assume we run an E-commerce company called FitFashionMart that has 2 web store-fronts namely (ComfortWearCo, SportStyleShop).
We will build our example using Machine Learning (ML) in SQL. ML-SQL bridges the gap between Data Engineers and Data Scientist so one can focus on solving business problems without the burden of managing complex infrastructure, distributed computing challenges, ML pipeline orchestration, governance, or scalability concerns. This streamlined approach enhances productivity, ensures data integrity, and strengthens security, all while providing seamless access to both data and ML models.
For our Model training we will leverage the ML-SQL extension in Data Distiller, though any SQL interface with machine learning capabilities can be used to achieve similar results.
Building features
From our local event source data for each store we are going to generate the following set of features per customer:
WITH customer_features AS (
SELECT
identityMap['ECID'][0].id AS customer_id,
web.webPageDetails.storeCode AS store_id,
AVG(COALESCE(productListItems.priceTotal[0], 0)) AS avg_order_value,
COUNT(COALESCE(productListItems.quantity[0], 0)) AS total_orders,
DATEDIFF(MAX(timestamp), MIN(timestamp)) AS customer_lifetime,
MAX(timestamp) AS last_purchase_date
FROM hbi_web_events_v1
WHERE EXISTS( productListItems, value -> value.priceTotal >0)
AND commerce.`order`.purchaseID IS NOT NULL
AND timestamp between $start_date and $end_date
GROUP BY identityMap['ECID'][0].id, store_id
),
mean_stddev AS (
SELECT
store_id,
AVG(avg_order_value) AS mean_avg_order_value,
STDDEV(avg_order_value) AS stddev_avg_order_value,
AVG(total_orders) AS mean_total_orders,
STDDEV(total_orders) AS stddev_total_orders,
AVG(customer_lifetime) AS mean_customer_lifetime,
STDDEV(customer_lifetime) AS stddev_customer_lifetime,
AVG(DATEDIFF(DAY, last_purchase_date, $end_date)) AS mean_days_since_last_purchase,
STDDEV(DATEDIFF(DAY, last_purchase_date, $end_date)) AS stddev_days_since_last_purchase
FROM customer_features
GROUP BY store_id
),
standardized_features AS (
SELECT
cf.customer_id, cf.store_id,cf.last_purchase_date,
DATEDIFF(DAY, cf.last_purchase_date, $end_date) as days_since_am_purchase,
(cf.avg_order_value - ms.mean_avg_order_value) / ms.stddev_avg_order_value AS standardized_avg_order_value,
(cf.total_orders - ms.mean_total_orders) / ms.stddev_total_orders AS standardized_total_orders,
(cf.customer_lifetime - ms.mean_customer_lifetime) / ms.stddev_customer_lifetime AS standardized_customer_lifetime,
(DATEDIFF(DAY, cf.last_purchase_date, $end_date) - ms.mean_days_since_last_purchase) / ms.stddev_days_since_last_purchase
AS standardized_days_since_last_purchase
FROM
customer_features cf, mean_stddev ms
),
store_fl_feature AS (
SELECT store_id,
CONCAT('churn_prediction_lr_', store_id) AS model_name,
COLLECT_LIST(
NAMED_STRUCT(
'customer_id', customer_id,
'standardized_avg_order_value', standardized_avg_order_value,
'standardized_total_orders', standardized_total_orders,
'standardized_customer_lifetime', standardized_customer_lifetime,
'standardized_days_since_last_purchase', standardized_days_since_last_purchase,
'churn_label',
CASE
WHEN standardized_days_since_last_purchase > 1 THEN 1 ELSE 0
END
)
) AS store_features
FROM standardized_features
GROUP BY store_id
)
SELECT * FROM store_fl_feature;
$start_date AND $end_date are parameters we used to select our time range (say 2 months of data) of events that we are interested in using for training.
These features are managed locally at the store level and will be used for local training.
Localized Machine Learning for Anomaly detection per store
Now that we have our features per web store-front or local store. We want to use fast, scalable machine learning that can create clusters based on customer behavior and identify an anomaly cluster with feature characteristics that identifies the customer(s) as outliers.
We will use Bisecting K-Means as it is a good choice for anomaly detection. This ML algorithm uses a hierarchical clustering approach that naturally isolates outliers into small clusters or identifies them as points far from centroids. In an e-commerce context, this can help detect fraudulent transactions, system faults, or unusual user behavior.
Local training for web-storefront - ComfortWearCo
Assuming storeCode = ‘comfort_wear_store_view
Step 1: Train the model for store front ComfortWearCo
-- Anamoly detection using bisecting_kmeans comfort_wear_store_view
CREATE MODEL anomaly_comfort_wear_store_view
TRANSFORM (
vector_assembler(array(
standardized_avg_order_value,
standardized_total_orders,
standardized_customer_lifetime,
standardized_days_since_last_purchase
)) features
)
OPTIONS (
MODEL_TYPE = 'bisecting_kmeans', NUM_CLUSTERS = 3, SEED = 2
)
AS
SELECT sf.standardized_avg_order_value as standardized_avg_order_value,
sf.standardized_total_orders as standardized_total_orders,
sf.standardized_customer_lifetime as standardized_customer_lifetime,
sf.standardized_days_since_last_purchase as standardized_days_since_last_purchase
FROM
(select explode(store_features) as sf
FROM src_for_fedlrn where store_id = 'comfort_wear_store_view');
Step 2: Evaluate the local model for store front ComfortWearCo
select * from model_evaluate (anomaly_comfort_wear_store_view, 1,
SELECT sf.standardized_avg_order_value as avg_order_value,
sf.standardized_total_orders as avg_total_orders,
sf.standardized_customer_lifetime as avg_customer_lifetime,
sf.standardized_days_since_last_purchase as avg_days_since_last_purchase
FROM
(SELECT explode(store_features) AS sf
FROM src_for_fedlrn where store_id = 'comfort_wear_store_view' ) );
model_evaluate the silhouette scores is 0.52
Step 3: Predict using the trained model (anomaly_comfort_wear_store_view) for store front ComfortWearCo
Doing a model_predict on the trained model gives us the following interesting observation
--Predict local model comfort_wear_store_view
create temp table lmodel_comfort_wear_store_view as
select * from model_predict ( anomaly_comfort_wear_store_view, 1,
SELECT sf.standardized_avg_order_value as standardized_avg_order_value,
sf.standardized_total_orders as standardized_total_orders,
sf.standardized_customer_lifetime as standardized_customer_lifetime,
sf.standardized_days_since_last_purchase as standardized_days_since_last_purchase
FROM
(select explode(store_features) as sf
FROM src_for_fedlrn where store_id = 'comfort_wear_store_view') );
SELECT prediction, count(1) AS cluster_size,
avg(standardized_avg_order_value) AS avg_order_value,
avg(standardized_total_orders) AS avg_total_orders,
avg(standardized_customer_lifetime) AS avg_customer_lifetime,
avg(standardized_days_since_last_purchase) AS avg_days_since_last_purchase
FROM lmodel_comfort_wear_store_view
GROUP BY prediction;
Anomalous customer behavior:
Cluster 1 likely represents a group of highly engaged, frequent shoppers who make purchases often and consistently over a long period. Their behavior deviates significantly from the more moderate patterns seen in other clusters 0 and 2, making this group an anomaly.
2. Local training for web-storefront - SportStyleShop
Assuming storeCode = ‘sporty_style_store_view)
Step 1: Train the model for store front SportStyleShop
Recommended by LinkedIn
-- Anamoly detection using bisecting_kmeans sporty_style_store_view
CREATE MODEL anomaly_sporty_style_store_view
TRANSFORM (
vector_assembler(array(
standardized_avg_order_value,
standardized_total_orders,
standardized_customer_lifetime,
standardized_days_since_last_purchase
)) features
)
OPTIONS (
MODEL_TYPE = 'bisecting_kmeans', NUM_CLUSTERS = 3, SEED = 2
)
AS
SELECT sf.standardized_avg_order_value as standardized_avg_order_value,
sf.standardized_total_orders as standardized_total_orders,
sf.standardized_customer_lifetime as standardized_customer_lifetime,
sf.standardized_days_since_last_purchase as standardized_days_since_last_purchase
FROM
(select explode(store_features) as sf
FROM src_for_fedlrn where store_id = 'sporty_style_store_view');
Step 2: Evaluate the local model for store front SportStyleShop
select * from model_evaluate (anomaly_sporty_style_store_view, 1,
SELECT sf.standardized_avg_order_value as avg_order_value,
sf.standardized_total_orders as avg_total_orders,
sf.standardized_customer_lifetime as avg_customer_lifetime,
sf.standardized_days_since_last_purchase as avg_days_since_last_purchase
FROM
(SELECT explode(store_features) AS sf
FROM src_for_fedlrn where store_id = 'sporty_style_store_view'));
model_evaluate the silhouette scores is 0.33
Step 3: Predict using the trained model (anomaly_sporty_style_store_view) for store front SportStyleShop
Doing a model_predict on the trained model gives us the following interesting observation
create temp table lmodel_sporty_style_store_view as
select * from model_predict ( anomaly_sporty_style_store_view, 1,
SELECT sf.standardized_avg_order_value as standardized_avg_order_value,
sf.standardized_total_orders as standardized_total_orders,
sf.standardized_customer_lifetime as standardized_customer_lifetime,
sf.standardized_days_since_last_purchase as standardized_days_since_last_purchase
FROM
(select explode(store_features) as sf
FROM src_for_fedlrn where store_id = 'sporty_style_store_view') );
SELECT prediction, count(1) AS cluster_size,
avg(standardized_avg_order_value) AS avg_order_value,
avg(standardized_total_orders) AS avg_total_orders,
avg(standardized_customer_lifetime) AS avg_customer_lifetime,
avg(standardized_days_since_last_purchase) AS avg_days_since_last_purchase
FROM lmodel_sporty_style_store_view
GROUP BY prediction;
Anomalous customer behavior:
Cluster 2 likely represents a group of customers who make unusually high-value purchases more recently than the general customer base. This pattern could indicate anomalous or fraudulent behavior, especially considering the small size and the extreme deviation in average order value compared to other clusters.
The two web store-fronts have different clusters categorized as anomalous. And each web store-fronts has also detected a local reason for being anomalous. This is a real life scenario that the global model design will have to take into consideration.
Global Machine Learning for Anomaly detection
The global model will be trained on aggregated features from each local store. The step by step process is
SELECT
prediction as local_prediction,
cast ('comfort_wear_store_view' as string) as store_id,
count(1) AS cluster_size,
avg(standardized_avg_order_value) AS avg_order_value,
avg(standardized_total_orders) AS avg_total_orders,
avg(standardized_customer_lifetime) AS avg_customer_lifetime,
avg(standardized_days_since_last_purchase) AS avg_days_since_last_purchase
FROM lmodel_comfort_wear_store_view
GROUP BY local_prediction
union all
SELECT
prediction as local_prediction,
cast ('sporty_style_store_view' as string) as store_id,
count(1) AS cluster_size,
avg(standardized_avg_order_value) AS avg_order_value,
avg(standardized_total_orders) AS avg_total_orders,
avg(standardized_customer_lifetime) AS avg_customer_lifetime,
avg(standardized_days_since_last_purchase) AS avg_days_since_last_purchase
FROM lmodel_sporty_style_store_view
GROUP BY local_prediction;
Collect Local Store Outputs:
The query above generates aggregated features for each cluster in each store (ComfortWearCo and SportStyleShop in our case), using the predictions from anomaly_comfort_wear_store_view and anomaly_sporty_style_store_view.
Combines the aggregated data from both stores:
The UNION ALL ensures that data from both stores is combined, including the relevant store information and the features for each cluster.
Use the aggregated data for all clusters from all stores as input to train a global anomaly detection model.Apply the same clustering technique Bisecting K-Means to these aggregated features. This will allow you to identify global clusters and potentially flag any outliers or anomalies based on the centroid distance across all stores
-- global meta-model
CREATE MODEL global_anomaly_fedmodel
TRANSFORM (
vector_assembler(array(
avg_order_value,
avg_total_orders,
avg_customer_lifetime,
avg_days_since_last_purchase
)) features
)
OPTIONS (
MODEL_TYPE = 'bisecting_kmeans', NUM_CLUSTERS = 3, SEED = 42
)
AS
SELECT
prediction as local_prediction,
cast ('comfort_wear_store_view' as string) as store_id,
count(1) AS cluster_size,
avg(standardized_avg_order_value) AS avg_order_value,
avg(standardized_total_orders) AS avg_total_orders,
avg(standardized_customer_lifetime) AS avg_customer_lifetime,
avg(standardized_days_since_last_purchase) AS avg_days_since_last_purchase
FROM lmodel_comfort_wear_store_view
GROUP BY local_prediction
union all
SELECT
prediction as local_prediction,
cast ('sporty_style_store_view' as string) as store_id,
count(1) AS cluster_size,
avg(standardized_avg_order_value) AS avg_order_value,
avg(standardized_total_orders) AS avg_total_orders,
avg(standardized_customer_lifetime) AS avg_customer_lifetime,
avg(standardized_days_since_last_purchase) AS avg_days_since_last_purchase
FROM lmodel_sporty_style_store_view
GROUP BY local_prediction;
global model evaluation:
Storefront ComfortWearCo silhouette scores is 0.65108090556
SELECT * from model_evaluate (global_anomaly_fedmodel, 1,
SELECT sf.standardized_avg_order_value as avg_order_value,
sf.standardized_total_orders as avg_total_orders,
sf.standardized_customer_lifetime as avg_customer_lifetime,
sf.standardized_days_since_last_purchase as
avg_days_since_last_purchase
FROM
( SELECT explode(store_features) AS sf
FROM src_for_fedlrn where store_id = 'comfort_wear_store_view'));
Storefront SportStyleShop silhouette scores is 0.843186308516
select * from model_evaluate (global_anomaly_fedmodel, 1,
SELECT sf.standardized_avg_order_value as avg_order_value,
sf.standardized_total_orders as avg_total_orders,
sf.standardized_customer_lifetime as avg_customer_lifetime,
sf.standardized_days_since_last_purchase as avg_days_since_last_purchase
FROM
(SELECT explode(store_features) as sf
FROM src_for_fedlrn where store_id = 'sporty_style_store_view'));
Storefront ComfortWearCo
-- comparing local and gloabl models for anamoly cluster
CREATE TEMP TABLE gmodel_comfy_wear_predict as
SELECT * FROM model_predict (global_anomaly_fedmodel, 1,
SELECT sf.standardized_avg_order_value as avg_order_value,
sf.standardized_total_orders as avg_total_orders,
sf.standardized_customer_lifetime as avg_customer_lifetime,
sf.standardized_days_since_last_purchase as avg_days_since_last_purchase
FROM
(SELECT explode(store_features) as sf
FROM src_for_fedlrn
WHERE store_id = 'comfort_wear_store_view')
)
;
SELECT prediction,
count(1) AS cluster_size,
avg(avg_order_value) AS avg_order_value,
avg(avg_total_orders) AS avg_total_orders,
avg(avg_customer_lifetime) AS avg_customer_lifetime,
avg(avg_days_since_last_purchase) AS avg_days_since_last_purchase
FROM gmodel_comfy_wear_predict
GROUP BY prediction;
Storefront SportStyleShop
CREATE TEMP TABLE gmodel_sporty_style_predict as
SELECT * FROM model_predict (global_anomaly_fedmodel, 1,
SELECT sf.standardized_avg_order_value as avg_order_value,
sf.standardized_total_orders as avg_total_orders,
sf.standardized_customer_lifetime as avg_customer_lifetime,
sf.standardized_days_since_last_purchase as avg_days_since_last_purchase
FROM
(SELECT explode(store_features) as sf
FROM src_for_fedlrn
WHERE store_id = 'sporty_style_store_view')
);
SELECT prediction,
count(1) AS cluster_size,
avg(avg_order_value) AS avg_order_value,
avg(avg_total_orders) AS avg_total_orders,
avg(avg_customer_lifetime) AS avg_customer_lifetime,
avg(avg_days_since_last_purchase) AS avg_days_since_last_purchase
FROM gmodel_sporty_style_predict
GROUP BY prediction;
Global Model Consistency: The global model clustering pattern is consistent with the local models:
Potential Anomaly Clusters:
Anomaly Cluster Matching: The global model aligned well with local model anomaly clusters:
What did we learn
By implementing federated learning using the Machine Learning SQL extension, E-commerce businesses can process large-scale, decentralized data efficiently while maintaining data privacy. This approach ensures robust and scalable machine learning solutions, helping drive data-driven decision-making and enhancing overall business ROI.
The ML-SQL extension in Data Distiller or similar SQL extension in other distributed computing frameworks seamlessly bridges the gap between data engineers and data scientists, offering a trusted, efficient, and scalable environment to integrate machine learning directly within SQL.
Big shoutout to the Data Distiller team for driving this innovation and making machine learning more accessible within SQL-based workflows!
Interesting take! Gave us something to think about.