Building an Attribution Feature Store: Solving the Data Context Problem in Marketing Modeling

Ngoc Thien Nguyen

Published Feb 7, 2026

Hello everyone,

This story is probably familiar to data folks working in Retail or E-commerce. One fine day, the C-Level books a team meeting and issues a directive: starting next month, the company wants to switch to Marketing Mix Modeling (MMM) or Multi-Touch Attribution (MTA) using Machine Learning to optimize ad budgets.

So, the entire Data Engineer team works tirelessly to build the pipeline. We pull data from every source imaginable, from Facebook Ads API, Google Ads, and TikTok to CRM and Web Tracking, pouring it all into the Warehouse. After two months of "sleeping on thorns and tasting gall," the Data Quality dashboard finally reports all green. Schema is standard, Data Types are correct, Null rate is under 1%. Everything seems perfect technically.

But when the Data Scientist runs the model for the first time, the results are so surreal that no one dares to believe them. The model says to turn off all Google Search Brand ads, while everyone knows this is the most critical bottom-funnel channel. Or worse, it suggests doubling the budget on October 15th—the day the server crashed, users reloaded the page constantly causing a spike in phantom views, but revenue was zero.

The problem lies in the fact that we Data Engineers usually only check Technical Validity (correct format, row counts, correct columns). But to do AI/ML for Marketing, what we desperately need is Semantic Validity, or correctness regarding business context. Data that is clean in code but soulless in semantics is the most dangerous kind of trash; they are the "silent killers" that destroy every AI project right in the cradle.

When the pipeline is green but the model is still wrong

Why is the pipeline green but the model wrong? Simply because Machine Learning models are very naive; they simply accept whatever data we feed them. They don't know about the external events that traditional rule-based ETLs often ignore.

Let's imagine an Out of Stock scenario. The Marketing team forgets to turn off ads for a best-selling SKU that has run out of stock. Money is still burning, Clicks are still high, but Conversion is 0. What will the model learn? It learns that this ad channel is terrible, users click but don't buy. But in reality, the channel is excellent; it's the supply chain that broke.

Or another case is a payment gateway incident. The Gateway is down for 2 hours during peak time, users add-to-cart aggressively but cannot checkout. The model will evaluate users from this campaign as just "window shoppers" with no conversion value, leading to an incorrect assessment of traffic quality. Not to mention signal loss due to iOS updates causing the discrepancy between Ad Platforms and internal data to reach 30-40%. If we just blindly feed raw data in, the model will be heavily biased, undervaluing mobile traffic and overvaluing desktop.

Attribution Feature Store Architecture and Context Injection

The solution here is to never ETL straight from the Raw Layer to the Model. Build an intermediate layer, a Feature Store or Enriched Gold Layer to handle semantics.

The core mindset here is Context Injection. We don't just clean data; we inject context into every row of data. Instead of deleting error rows, flag them so the Model knows how to handle them. We need important flags like is_out_of_stock to know if the SKU was out of stock at the time of the click, is_major_promo to separate baseline sales and uplift sales during Mega Sale days, or system_health_score to identify when the payment gateway is flickering.

Article content — Fig 1. Legacy vs New Way

Show Me The Code

Theory is easy to talk about. To convince difficult engineers, I will go into technical details with 3 SQL logics that I always require in the Silver/Gold layer before data touches the Model.

First is the Cross-Platform Discrepancy logic to catch Tracking Loss. We need to compare figures from Ad Network via API and internal Web Analytics. If the discrepancy is too large, data for that day must be labeled "Low Confidence" immediately.

WITH ad_network_stats AS (
    SELECT 
        date, 
        campaign_id, 
        SUM(clicks) as platform_clicks
    FROM `raw.facebook_ads`
    GROUP BY 1, 2
),
internal_tracker_stats AS (
    SELECT 
        date, 
        utm_campaign as campaign_id,
        COUNT(DISTINCT session_id) as tracked_sessions
    FROM `raw.clickstream_events`
    WHERE source = 'facebook'
    GROUP BY 1, 2
)

SELECT 
    a.date,
    a.campaign_id,
    a.platform_clicks,
    b.tracked_sessions,
    -- Calculate % Discrepancy
    SAFE_DIVIDE(ABS(a.platform_clicks - b.tracked_sessions), a.platform_clicks) as discrepancy_rate,
    CASE 
        -- If discrepancy > 20% and volume is significant -> Data is semantically dirty
        WHEN SAFE_DIVIDE(ABS(a.platform_clicks - b.tracked_sessions), a.platform_clicks) > 0.2 
             AND a.platform_clicks > 100 THEN 'LOW_CONFIDENCE'
        ELSE 'HIGH_CONFIDENCE'
    END as data_quality_flag
FROM ad_network_stats a
LEFT JOIN internal_tracker_stats b USING(date, campaign_id)

Next is Context Awareness, where we teach the Model: I was out of stock then, dont blame me. This is the code block connecting Marketing Data and Supply Chain Data, calculating the availability rate of each SKU.

The ROI Problem: The True Value of Contextual Data

To see clearly why we must struggle through these Feature Engineering steps, let's consider a typical 11.11 Flash Sale scenario. The Hero product is Out of Stock from 10 AM to 2 PM, but Marketing forgets to turn off ads.

If using a Naive Model running on Raw Data, it will see high Spend, high Click, but 0 Conversion during that timeframe. Its conclusion will be that Facebook Ads at noon are extremely wasteful, and the next action is to propose cutting the Facebook Ads budget entirely for the upcoming 12.12 noon slot. The consequence is losing massive potential revenue in the next sale when stock is full but ads are wrongly turned off.

Conversely, if using a Model with Feature Store, it also sees high Spend and 0 Conversion, but it sees the is_out_of_stock flag is TRUE. Its conclusion will be that Conversion is low due to stockout, while ad channel efficiency is still stable. It will keep the Attribution Score intact and suggest the Supply Chain team prepare stock better. The result is preserving ROI and operational processes.

Clearly, Data Engineering here is not just a data pipeline; it is protecting data context.

Implementing Circuit Breaker Pattern for Data Pipeline

Sometimes data from 3rd party APIs is severely incorrect, for example, Facebook returning Cost multiplied by 100 due to a currency error. If we let this sneak into training, it will ruin the model permanently, also known as Model Poisoning.

Let's install a Data Circuit Breaker, an automatic switch right before exporting data to the Model Training pipeline. The mechanism is very simple: if metric discrepancy or data quality score exceeds the allowable threshold, the breaker trips. The pipeline stops immediately. We absolutely DO NOT overwrite new data into the Model Store but will use a Fallback Strategy, using yesterday's data or an old Model version for temporary forecasting, while sending a PagerDuty alert for the Data Engineer to wake up and fix it.

My philosophy is that it's better to let the Model use stale data than to feed it toxic data.

Garbage In, Disaster Out

With traditional BI reports, a boss looking at a chart that is 10% wrong can mentally "adjust" based on experience. But with MMM, the Model is a naive black box. It treats every piece of data we feed it as the absolute truth.

Data Quality for AI is 10 times stricter than Data Quality for BI. Don't use Dashboard standards to do Data Prep for a Model. Start by reviewing your pipeline tomorrow and ask yourself: Are we tracking inventory status and system status parallel to clickstream? If not, start doing it right away. Good luck, everyone.

Happy Engineering & Happy New Year!

Rami Akeela, Ph.D. 2mo

The real risk isn’t ‘bad data’ that gets caught. It’s ‘plausible’ data that passes checks and quietly misleads decisions. Adding context plus circuit breakers is how you keep MMM outputs decision grade.

Building an Attribution Feature Store: Solving the Data Context Problem in Marketing Modeling

Ngoc Thien Nguyen

When the pipeline is green but the model is still wrong

Attribution Feature Store Architecture and Context Injection

Show Me The Code

Recommended by LinkedIn

The ROI Problem: The True Value of Contextual Data

Implementing Circuit Breaker Pattern for Data Pipeline

Garbage In, Disaster Out

More articles by Ngoc Thien Nguyen

Others also viewed

Google's Cookie Shift: Data Science Takes Center Stage

Conversion Rate Optimization (CRO) with Machine Learning Techniques: A Comprehensive Data-Driven Approach

AI and Data Strategy with James Greaney: Driving Growth Through Analytics

We’re Asking If Marketing Is AI-Ready. But Are We Even Data-Ready?

Peaka Digest #49 💪 A New Addition to Peaka 🗞️ Enhanced Data Table Functionality

Platform Analytics: A New Paradigm of Data Value Creation

Can big data do a better marketing job than you?

Basic Time Series Analysis: Studying Trends Over Time

Story-Driven Data Visualisation -What exactly is this and how can it help your brand or business?

Explore content categories

When the pipeline is green but the model is still wrong

Attribution Feature Store Architecture and Context Injection

Show Me The Code

Recommended by LinkedIn

The ROI Problem: The True Value of Contextual Data

Implementing Circuit Breaker Pattern for Data Pipeline

Garbage In, Disaster Out

More articles by Ngoc Thien Nguyen

Databricks Auto Loader: Solving ADLS List API Costs and RocksDB State Store Bloat in Small File Processing

What is Liquid Clustering? And why does Databricks recommend it over Partitioning and Z-Ordering?

Legacy Dynamics AX: Metadata-driven for Data Lake

Marketing Attribution: The Shift-Left Strategy to Handle UTM Errors

From ROAS to POAS: Solving Latency vs Precision Trade-off in Ad Bidding

5 A/B Testing Architectures from Big Tech: We’ve Been Over-simplifying It

Multi-touch Attribution at Scale: When SQL Hits the Wall and Graph is the Escape

When JOIN ON is Helpless: Reviving Data Clean Rooms with Probabilistic Engineering

Modeling the Invisible: Using Scaffolding to Measure Lost Revenue & Phantom Inventory

Data Vault: From hyper-normalized theory to practical operational strategy (Business Vault & PIT)

Others also viewed

Google's Cookie Shift: Data Science Takes Center Stage

Conversion Rate Optimization (CRO) with Machine Learning Techniques: A Comprehensive Data-Driven Approach

AI and Data Strategy with James Greaney: Driving Growth Through Analytics

We’re Asking If Marketing Is AI-Ready. But Are We Even Data-Ready?

Peaka Digest #49 💪 A New Addition to Peaka 🗞️ Enhanced Data Table Functionality

Platform Analytics: A New Paradigm of Data Value Creation

Can big data do a better marketing job than you?

Basic Time Series Analysis: Studying Trends Over Time

Story-Driven Data Visualisation -What exactly is this and how can it help your brand or business?

Similar topics

Attribution Data Management

How to Prevent AI Model Collapse From Poor Data Quality

Explore content categories