Analytical architecture evolution - part 3

Analytical architecture evolution - part 3

Part 3: The engine room of the ecosystem – the feature store

Let’s set ourselves a challenge – I want to be able to get a new column added to the ‘CRM datamart’, deploy a new predictive model, build a new campaign and create a whole bunch of new campaign reports all within 24 hours without calling in a small army of engineers!

The trick to solving this challenge is to think about each of these use cases as features:

  • A new column = base feature
  • A new predictive model = derived feature
  • A new campaign = campaign feature
  • Campaign performance metrics = performance features

If we cast our mind back to the previous posts you’ll remember we introduced the metadata-driven concept. This is crucial to the feature store and the metadata is stored in our feature dictionary.

When an analyst wants to create a new column (a base feature) they build a new feature in the feature dictionary. There is a bit of information required when creating a feature as this metadata is passed into our data pipeline and is used to configure the production process. We use a naming convention so that we know what each feature is and then set up other metadata like ‘how often to evaluate this feature’, ‘what language am I going to use to specify the feature’ and the most important piece of metadata: ‘the code that I want the production process to execute’.

What this means is that an analyst can build a new base feature, tell the process how often to evaluate this feature, specify the code (SQL, Python, R etc.) and then pass in the code. The operational process is waiting for this metadata and will now evaluate the feature.

A campaign feature is a derived feature, for example ‘if A and B and C<100 then 1 else 0’ would set a binary flag indicating if you were in a campaign treatment group where A, B & C are other base features. The difference between a base and derived feature is the source of the data. When we evaluate the base feature we use HDFS as the source, however the source for derived features is the feature store itself.

A predictive model is another type of derived feature, the analyst would pass in the R or Python code and the production process would evaluate the model (we’re working on a way to take this even further!).

A really cool extension of this pattern is to consider campaign performance metrics as features. What this means is that we can track any metric for any campaign as long as we can specify the metric in code (and the data is on the platform!).

The feature store is stored in a couple of formats:

  1. Key-Value pair (kind of!) – see the comment on loose coupling in Part 2; and
  2. Wide format – this is how analysts want to access the information.

As you can imagine the wide format schema is constantly evolving so is better suited to a NOSQL database. This format is also used as the input data when building predictive models, which removes a lot of the friction involved in preparing datasets for modelling.

The flexibility and lack of friction in the feature store gives us an opportunity to dive a bit deeper into feature engineering – more about this in Part 4…

Great article Mark. Have you had a look at Ivory as a possible feature repository: https://github.com/ambiata/ivory

Nice artical, Mark! I could see that the key value pair makes a difference for loosely coupling, the wide format could be a final view based on the key value pair. The metadata driven part is interesting, I would be eager to know if the feature registration will make coding semi automatic or even automatic to faster the data sourcing process. Would u explore a bit more on that for the following parts?

To view or add a comment, sign in

More articles by Mark Hunter

  • It's Time To Move...Back to Scotland!

    In April, almost 10 years after we left Scotland for a 6-month (ahem) project in Beijing, we will be relocating back to…

    45 Comments
  • IAPA Skills & Salary Survey 2015

    Today’s data requires talent. And, being talented pays.

  • Analytical architecture evolution - part 4

    Part 4: Feature engineering So now that we’ve removed the friction out of the process of creating and deploying…

  • Analytical architecture evolution - part 2

    Before we dive into the ecosystem there are two key concepts to discuss: Loose coupling; and Metadata-driven Loose…

    2 Comments
  • Analytical architecture evolution

    This is the first of a series of posts where I'll share my thoughts on the evolution of the analytical architecture…

    4 Comments

Others also viewed

Explore content categories