Feature Engineering with Dates and Python - Summary

Feature Engineering with Dates and Python - Summary

Feature engineering is often the most important part…

is what research on winning Kaggle’rs by David Wind reveals. It is somewhat ignored in the texts about machine learning, which seem to focus more on the algorithms, rather than what is being fed into these algorithms. What is feature engineering then?

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. 1

Main tasks in feature engineering are:

  • Feature creation/extraction: This is the process of exposing data hidden inside fields in the raw data format. This process does not vary by algorithms consuming the features generally. Better understanding of the business domain, intuition and experience play a big role here
  • Feature selection: This is process of deciding what features are more useful for the models in question. Different technique can be used. These techniques may also vary based on the algorithm using the features

To demonstrate the process of feature extraction, lets take simple date/time stamp field and see what features could be extracted from this one field.

When we look at a date time stamp, a number of features, or pieces of information are immediately obvious:

  • Year
  • Month
  • Day
  • Day of week
  • Week of year
  • Hour of day

Month and day of week can be quite useful in understanding periodicity or seasonality of transactions. We may find that some actions are more probable on certain days of the week, or somethings happen around the same month every year. With Halloween around the corner for example, you are probably shopping for candy right now.

Now, let's think of more interesting features that may involve looks. How about seasons, or times of days? I was working for a client that had a ton of recipes on their site. One of the questions they asked was about a content recommender and what is the right time to display what type of recipe. We used a time of day map to understand when people see breakfast recipe, kids lunch box recipes, appetizers etc. Intuitively, you can imagine that people prepare for next day’s lunch and breakfast around or after dinner, especially if you have children. If you coming to the end of your work day, you are probably thinking about dinner and what you could pick up on you way home. This feature was extracted from clickstream to enrich the data and give additional insight in to what types of recipes to show at what time. Season was also a good predictor to understand which recipes are timeless and which are more seasonal. For another grocery client, we saw huge uptick in browsing flyers between 8 and 10 am and noon to 1pm (lunch hour) during week days. Further, Wednesday and Thursdays were heaviest traffic days of the week as most people planned for grocery shopping just before the weekend. 

Distance between Holidays

Retail is a very seasonal business. Often, people are buying for an occasion or near an occasion. Intuitively, people may be purchasing for Valentine’s day, Thanksgiving due to all the great sales, Christmas etc. To understand which customers are more driven by these special occurrences, a set of features need to be created that measures distances to these occurrences.

To read the detailed tutorials with Python code, please see:

Part I: Creating 9 basic features from date/time

Part II: Creating advanced features by combining with public holiday information


[1] http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/ 

To view or add a comment, sign in

More articles by Ashish B.

Others also viewed

Explore content categories