Building an Attribution Feature Store: Solving the Data Context Problem in Marketing Modeling
Hello everyone,
This story is probably familiar to data folks working in Retail or E-commerce. One fine day, the C-Level books a team meeting and issues a directive: starting next month, the company wants to switch to Marketing Mix Modeling (MMM) or Multi-Touch Attribution (MTA) using Machine Learning to optimize ad budgets.
So, the entire Data Engineer team works tirelessly to build the pipeline. We pull data from every source imaginable, from Facebook Ads API, Google Ads, and TikTok to CRM and Web Tracking, pouring it all into the Warehouse. After two months of "sleeping on thorns and tasting gall," the Data Quality dashboard finally reports all green. Schema is standard, Data Types are correct, Null rate is under 1%. Everything seems perfect technically.
But when the Data Scientist runs the model for the first time, the results are so surreal that no one dares to believe them. The model says to turn off all Google Search Brand ads, while everyone knows this is the most critical bottom-funnel channel. Or worse, it suggests doubling the budget on October 15th—the day the server crashed, users reloaded the page constantly causing a spike in phantom views, but revenue was zero.
The problem lies in the fact that we Data Engineers usually only check Technical Validity (correct format, row counts, correct columns). But to do AI/ML for Marketing, what we desperately need is Semantic Validity, or correctness regarding business context. Data that is clean in code but soulless in semantics is the most dangerous kind of trash; they are the "silent killers" that destroy every AI project right in the cradle.
When the pipeline is green but the model is still wrong
Why is the pipeline green but the model wrong? Simply because Machine Learning models are very naive; they simply accept whatever data we feed them. They don't know about the external events that traditional rule-based ETLs often ignore.
Let's imagine an Out of Stock scenario. The Marketing team forgets to turn off ads for a best-selling SKU that has run out of stock. Money is still burning, Clicks are still high, but Conversion is 0. What will the model learn? It learns that this ad channel is terrible, users click but don't buy. But in reality, the channel is excellent; it's the supply chain that broke.
Or another case is a payment gateway incident. The Gateway is down for 2 hours during peak time, users add-to-cart aggressively but cannot checkout. The model will evaluate users from this campaign as just "window shoppers" with no conversion value, leading to an incorrect assessment of traffic quality. Not to mention signal loss due to iOS updates causing the discrepancy between Ad Platforms and internal data to reach 30-40%. If we just blindly feed raw data in, the model will be heavily biased, undervaluing mobile traffic and overvaluing desktop.
Attribution Feature Store Architecture and Context Injection
The solution here is to never ETL straight from the Raw Layer to the Model. Build an intermediate layer, a Feature Store or Enriched Gold Layer to handle semantics.
The core mindset here is Context Injection. We don't just clean data; we inject context into every row of data. Instead of deleting error rows, flag them so the Model knows how to handle them. We need important flags like is_out_of_stock to know if the SKU was out of stock at the time of the click, is_major_promo to separate baseline sales and uplift sales during Mega Sale days, or system_health_score to identify when the payment gateway is flickering.
Show Me The Code
Theory is easy to talk about. To convince difficult engineers, I will go into technical details with 3 SQL logics that I always require in the Silver/Gold layer before data touches the Model.
First is the Cross-Platform Discrepancy logic to catch Tracking Loss. We need to compare figures from Ad Network via API and internal Web Analytics. If the discrepancy is too large, data for that day must be labeled "Low Confidence" immediately.
WITH ad_network_stats AS (
SELECT
date,
campaign_id,
SUM(clicks) as platform_clicks
FROM `raw.facebook_ads`
GROUP BY 1, 2
),
internal_tracker_stats AS (
SELECT
date,
utm_campaign as campaign_id,
COUNT(DISTINCT session_id) as tracked_sessions
FROM `raw.clickstream_events`
WHERE source = 'facebook'
GROUP BY 1, 2
)
SELECT
a.date,
a.campaign_id,
a.platform_clicks,
b.tracked_sessions,
-- Calculate % Discrepancy
SAFE_DIVIDE(ABS(a.platform_clicks - b.tracked_sessions), a.platform_clicks) as discrepancy_rate,
CASE
-- If discrepancy > 20% and volume is significant -> Data is semantically dirty
WHEN SAFE_DIVIDE(ABS(a.platform_clicks - b.tracked_sessions), a.platform_clicks) > 0.2
AND a.platform_clicks > 100 THEN 'LOW_CONFIDENCE'
ELSE 'HIGH_CONFIDENCE'
END as data_quality_flag
FROM ad_network_stats a
LEFT JOIN internal_tracker_stats b USING(date, campaign_id)
Next is Context Awareness, where we teach the Model: I was out of stock then, dont blame me. This is the code block connecting Marketing Data and Supply Chain Data, calculating the availability rate of each SKU.
Recommended by LinkedIn
-- Feature Engineering: Calculate Availability Rate
WITH daily_inventory AS (
SELECT
date,
sku_id,
-- If warehouse snapshots hourly, calculate ratio of hours in stock / 24h
SUM(CASE WHEN stock_quantity > 0 THEN 1 ELSE 0 END) / 24.0 as availability_rate
FROM `warehouse.inventory_hourly_snapshot`
GROUP BY 1, 2
),
campaign_product_mapping AS (
SELECT campaign_id, sku_id FROM `marketing.campaign_config`
)
SELECT
m.date,
m.campaign_id,
m.impressions,
m.conversions,
AVG(i.availability_rate) as avg_sku_availability
FROM `silver.marketing_performance` m
JOIN campaign_product_mapping map ON m.campaign_id = map.campaign_id
JOIN daily_inventory i ON m.date = i.date AND map.sku_id = i.sku_id
GROUP BY 1, 2, 3, 4
Finally, the Silent Failure Detector, using Z-Score to catch system errors like site crashes or registration form failures that standard ETL pipelines never flag as red.
WITH daily_stats AS (
SELECT date, source, sessions, transactions,
SAFE_DIVIDE(transactions, sessions) as cvr
FROM `gold.traffic_summary`
),
stats_window AS (
SELECT *,
-- Moving Avg & StdDev over 14 days (Rolling window)
AVG(cvr) OVER (PARTITION BY source ORDER BY date ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING) as moving_avg_cvr,
STDDEV(cvr) OVER (PARTITION BY source ORDER BY date ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING) as moving_std_cvr
FROM daily_stats
)
SELECT *,
-- Calculate Z-Score: (Current Value - Average) / StdDev
(cvr - moving_avg_cvr) / NULLIF(moving_std_cvr, 0) as z_score,
CASE
-- If CVR drops drastically (less than -3 stddev) -> Likely System Outage
WHEN (cvr - moving_avg_cvr) / NULLIF(moving_std_cvr, 0) < -3 THEN 'POTENTIAL_OUTAGE'
ELSE 'NORMAL'
END as system_health_tag
FROM stats_window
The ROI Problem: The True Value of Contextual Data
To see clearly why we must struggle through these Feature Engineering steps, let's consider a typical 11.11 Flash Sale scenario. The Hero product is Out of Stock from 10 AM to 2 PM, but Marketing forgets to turn off ads.
If using a Naive Model running on Raw Data, it will see high Spend, high Click, but 0 Conversion during that timeframe. Its conclusion will be that Facebook Ads at noon are extremely wasteful, and the next action is to propose cutting the Facebook Ads budget entirely for the upcoming 12.12 noon slot. The consequence is losing massive potential revenue in the next sale when stock is full but ads are wrongly turned off.
Conversely, if using a Model with Feature Store, it also sees high Spend and 0 Conversion, but it sees the is_out_of_stock flag is TRUE. Its conclusion will be that Conversion is low due to stockout, while ad channel efficiency is still stable. It will keep the Attribution Score intact and suggest the Supply Chain team prepare stock better. The result is preserving ROI and operational processes.
Clearly, Data Engineering here is not just a data pipeline; it is protecting data context.
Implementing Circuit Breaker Pattern for Data Pipeline
Sometimes data from 3rd party APIs is severely incorrect, for example, Facebook returning Cost multiplied by 100 due to a currency error. If we let this sneak into training, it will ruin the model permanently, also known as Model Poisoning.
Let's install a Data Circuit Breaker, an automatic switch right before exporting data to the Model Training pipeline. The mechanism is very simple: if metric discrepancy or data quality score exceeds the allowable threshold, the breaker trips. The pipeline stops immediately. We absolutely DO NOT overwrite new data into the Model Store but will use a Fallback Strategy, using yesterday's data or an old Model version for temporary forecasting, while sending a PagerDuty alert for the Data Engineer to wake up and fix it.
My philosophy is that it's better to let the Model use stale data than to feed it toxic data.
Garbage In, Disaster Out
With traditional BI reports, a boss looking at a chart that is 10% wrong can mentally "adjust" based on experience. But with MMM, the Model is a naive black box. It treats every piece of data we feed it as the absolute truth.
Data Quality for AI is 10 times stricter than Data Quality for BI. Don't use Dashboard standards to do Data Prep for a Model. Start by reviewing your pipeline tomorrow and ask yourself: Are we tracking inventory status and system status parallel to clickstream? If not, start doing it right away. Good luck, everyone.
Happy Engineering & Happy New Year!
The real risk isn’t ‘bad data’ that gets caught. It’s ‘plausible’ data that passes checks and quietly misleads decisions. Adding context plus circuit breakers is how you keep MMM outputs decision grade.