Analytical architecture evolution - part 3

Mark Hunter

Published Jul 25, 2015

Part 3: The engine room of the ecosystem – the feature store

Let’s set ourselves a challenge – I want to be able to get a new column added to the ‘CRM datamart’, deploy a new predictive model, build a new campaign and create a whole bunch of new campaign reports all within 24 hours without calling in a small army of engineers!

The trick to solving this challenge is to think about each of these use cases as features:

A new column = base feature
A new predictive model = derived feature
A new campaign = campaign feature
Campaign performance metrics = performance features

If we cast our mind back to the previous posts you’ll remember we introduced the metadata-driven concept. This is crucial to the feature store and the metadata is stored in our feature dictionary.

When an analyst wants to create a new column (a base feature) they build a new feature in the feature dictionary. There is a bit of information required when creating a feature as this metadata is passed into our data pipeline and is used to configure the production process. We use a naming convention so that we know what each feature is and then set up other metadata like ‘how often to evaluate this feature’, ‘what language am I going to use to specify the feature’ and the most important piece of metadata: ‘the code that I want the production process to execute’.

What this means is that an analyst can build a new base feature, tell the process how often to evaluate this feature, specify the code (SQL, Python, R etc.) and then pass in the code. The operational process is waiting for this metadata and will now evaluate the feature.

A campaign feature is a derived feature, for example ‘if A and B and C<100 then 1 else 0’ would set a binary flag indicating if you were in a campaign treatment group where A, B & C are other base features. The difference between a base and derived feature is the source of the data. When we evaluate the base feature we use HDFS as the source, however the source for derived features is the feature store itself.

A predictive model is another type of derived feature, the analyst would pass in the R or Python code and the production process would evaluate the model (we’re working on a way to take this even further!).

A really cool extension of this pattern is to consider campaign performance metrics as features. What this means is that we can track any metric for any campaign as long as we can specify the metric in code (and the data is on the platform!).

The feature store is stored in a couple of formats:

Key-Value pair (kind of!) – see the comment on loose coupling in Part 2; and
Wide format – this is how analysts want to access the information.

As you can imagine the wide format schema is constantly evolving so is better suited to a NOSQL database. This format is also used as the input data when building predictive models, which removes a lot of the friction involved in preparing datasets for modelling.

The flexibility and lack of friction in the feature store gives us an opportunity to dive a bit deeper into feature engineering – more about this in Part 4…

Rami Mukhtar 10y

Great article Mark. Have you had a look at Ivory as a possible feature repository: https://github.com/ambiata/ivory

1 Reaction

Leo Mao 10y

Nice artical, Mark! I could see that the key value pair makes a difference for loosely coupling, the wide format could be a final view based on the key value pair. The metadata driven part is interesting, I would be eager to know if the feature registration will make coding semi automatic or even automatic to faster the data sourcing process. Would u explore a bit more on that for the following parts?

Analytical architecture evolution - part 3

Mark Hunter

Part 3: The engine room of the ecosystem – the feature store

More articles by Mark Hunter

Others also viewed

Probabilistic Architecture in a Deterministic World: The Enterprise GenAI Integration Guide

📐 Lakehouse Architecture Essentials – The Well-Architected Lakehouse 🏛️🌐

Reflections on Data Flow Patterns

On Complexity

The Architecture Of Velocity: Engineering Deterministic Real Time Data Pipelines

PowerReviews Engineering Digest: How we Modernized our Data ETLs

Introduction to Data Engineering Concepts |12| Scheduling and Workflow Orchestration

Improve Streamsets pipelines performance with loose coupling architecture - Part 2

Lambda Architecture: A unified approach to batch and real time processing

Half Medallion architecture

Explore content categories

Part 3: The engine room of the ecosystem – the feature store

More articles by Mark Hunter

It's Time To Move...Back to Scotland!

IAPA Skills & Salary Survey 2015

Analytical architecture evolution - part 4

Analytical architecture evolution - part 2