Feature Engineering: Transforming Data into Intelligence
In the world of machine learning, while algorithms and model architectures capture headlines, the real magic often happens in less glamorous but equally critical phase- feature engineering. I wanted to share my learning from one of my project which involved thoughtful feature engineering, which delivered great business value to the product.
What is Feature Engineering?
Its a process of extracting useful information from raw data and transform it into features that make machine learning model work effectively. It involves
The Foundation: Understanding Your Data
Understanding what your data represents in the real world is just as important as understanding its statistical properties. Every successful feature engineering effort begins with thorough exploration of data. Statistical measures like correlation matrices, scatter plots, distribution plots provide great insights on data and inspires to derive meaningful new features.
Real World Example
Lets take help of some real world example and understand feature engineering. Before I bring you the problem statement a bit of context would help to understand the purpose of the problem statement.
I was part of a small, dynamic team consisting of an architect, a data scientist, and a couple of engineers. We were on the verge of building something promising that would help our product attract more business. Our mission was to develop a patient flow capacity suite system—a comprehensive solution with several key features, including the ability to predict whether a patient would become a frequent flyer.
In short, Frequent Flyer patient here means a patient who frequently visits hospital ED or is admitted to hospital.
Therefore the problem statement
Predict if a patient will become a frequent flyer based on their visit history and demographics.
Essential Tools
My feature engineering toolkit includes:
Lets talk about the impact of doing feature engineering with an example , consider below example which is an extract of patient raw data
Recommended by LinkedIn
#python
#Raw data: Patient visit timestamps
visits = ['2024-01-01 14:30', '2024-01-02 02:15', '2024-01-03 23:45']
# Poor feature: Just the timestamp string (not useful for ML models)
raw_timestamps = visits
# Good features: Extract meaningful patterns from timestamps
import pandas as pd
from datetime import datetime
# Convert to datetime and extract useful features
visit_times = pd.to_datetime(visits)
# Time-based features
hour_of_day = visit_times.hour # [14, 2, 23] - identifies peak hours
day_of_week = visit_times.dayofweek # [0, 1, 2] - Monday, Tuesday, Wednesday
is_weekend = visit_times.dayofweek >= 5 # [False, False, False]
is_night_visit = (visit_times.hour >= 22) | (visit_times.hour <= 6) # [False, True, True]
# Temporal pattern features
time_between_visits = visit_times.diff().dt.total_seconds() / 3600 # Hours between visits
visit_frequency = len(visits) # Total visit count
days_since_first_visit = (visit_times - visit_times.min()).dt.days # [0, 1, 2]
From this transformation, we can see how thoughtful feature engineering significantly improves model performance and efficiency. The example above showing the difference between poor feature and good feature demonstrates how feature engineering transforms unusable raw data into actionable insights that directly contribute to identifying frequent flyer patients.
I am not an expert, and honestly, I dont think there is a defined formula to get feature engineering exactly right on the first attempt. The beauty of feature engineering lies in its exploratory nature - you start with your best intuition about what features might influence the outcome you're trying to predict. From there it becomes a systematic, creative and iterative process which results in creating new features based on your business requirement.
# 𝙴𝚡𝚊𝚖𝚙𝚕𝚎: 𝙿𝚊𝚝𝚒𝚎𝚗𝚝 𝚟𝚒𝚜𝚒𝚝 𝚍𝚊𝚝𝚊
𝚛𝚊𝚠_𝚍𝚊𝚝𝚊 =
𝚙𝚍.𝙳𝚊𝚝𝚊𝙵𝚛𝚊𝚖𝚎({
'𝚙𝚊𝚝𝚒𝚎𝚗𝚝_𝚒𝚍': ['𝙿𝟶𝟶𝟷', '𝙿𝟶𝟶𝟷', '𝙿𝟶𝟶𝟷', '𝙿𝟶𝟶𝟸', '𝙿𝟶𝟶𝟸'],
'𝚟𝚒𝚜𝚒𝚝_𝚍𝚊𝚝𝚎': ['𝟸𝟶𝟸𝟺-𝟶𝟷-𝟶𝟷', '𝟸𝟶𝟸𝟺-𝟶𝟷-𝟶𝟸', '𝟸𝟶𝟸𝟺-𝟶𝟷-𝟶𝟻', '𝟸𝟶𝟸𝟺-𝟶𝟷-𝟷𝟶', '𝟸𝟶𝟸𝟺-𝟶𝟷-𝟷𝟻'],
'𝚟𝚒𝚜𝚒𝚝_𝚝𝚒𝚖𝚎': ['𝟷𝟺:𝟹𝟶', '𝟶𝟸:𝟷𝟻', '𝟷𝟼:𝟺𝟻', '𝟶𝟿:𝟹𝟶', '𝟸𝟶:𝟷𝟻'],
'𝚊𝚐𝚎': [𝟺𝟻, 𝟺𝟻, 𝟺𝟻, 𝟹𝟸, 𝟹𝟸],
'𝚐𝚎𝚗𝚍𝚎𝚛': ['𝙼', '𝙼', '𝙼', '𝙵', '𝙵'],
'𝚒𝚗𝚜𝚞𝚛𝚊𝚗𝚌𝚎': ['𝙼𝚎𝚍𝚒𝚌𝚊𝚛𝚎', '𝙼𝚎𝚍𝚒𝚌𝚊𝚛𝚎', '𝙼𝚎𝚍𝚒𝚌𝚊𝚛𝚎', '𝙿𝚛𝚒𝚟𝚊𝚝𝚎', '𝙿𝚛𝚒𝚟𝚊𝚝𝚎'],
'𝚌𝚑𝚒𝚎𝚏_𝚌𝚘𝚖𝚙𝚕𝚊𝚒𝚗𝚝': ['𝚌𝚑𝚎𝚜𝚝 𝚙𝚊𝚒𝚗', '𝚌𝚑𝚎𝚜𝚝 𝚙𝚊𝚒𝚗', '𝚊𝚗𝚡𝚒𝚎𝚝𝚢', '𝚑𝚎𝚊𝚍𝚊𝚌𝚑𝚎', '𝚖𝚒𝚐𝚛𝚊𝚒𝚗𝚎'],
'𝚝𝚛𝚒𝚊𝚐𝚎_𝚜𝚌𝚘𝚛𝚎': [𝟸, 𝟷, 𝟹, 𝟺, 𝟹],
'𝚕𝚎𝚗𝚐𝚝𝚑_𝚘𝚏_𝚜𝚝𝚊𝚢': [𝟺.𝟻, 𝟼.𝟶,𝟸.𝟻, 𝟷.𝟻,𝟸.𝟶],
'𝚍𝚒𝚜𝚌𝚑𝚊𝚛𝚐𝚎_𝚍𝚒𝚜𝚙𝚘𝚜𝚒𝚝𝚒𝚘𝚗': ['𝙳𝚒𝚜𝚌𝚑𝚊𝚛𝚐𝚎𝚍', '𝙰𝚍𝚖𝚒𝚝𝚝𝚎𝚍', '𝙳𝚒𝚜𝚌𝚑𝚊𝚛𝚐𝚎𝚍', '𝙳𝚒𝚜𝚌𝚑𝚊𝚛𝚐𝚎𝚍', '𝙳𝚒𝚜𝚌𝚑𝚊𝚛𝚐𝚎𝚍']
})
Lets import important libraries
𝗂𝗆𝗉𝗈𝗋𝗍 𝗉𝖺𝗇𝖽𝖺𝗌 𝖺𝗌 𝗉𝖽
𝗂𝗆𝗉𝗈𝗋𝗍 𝗇𝗎𝗆𝗉𝗒 𝖺𝗌 𝗇𝗉
𝖿𝗋𝗈𝗆 𝗌𝗄𝗅𝖾𝖺𝗋𝗇.𝖾𝗇𝗌𝖾𝗆𝖻𝗅𝖾 𝗂𝗆𝗉𝗈𝗋𝗍 𝖱𝖺𝗇𝖽𝗈𝗆𝖥𝗈𝗋𝖾𝗌𝗍𝖢𝗅𝖺𝗌𝗌𝗂𝖿𝗂𝖾𝗋, 𝖦𝗋𝖺𝖽𝗂𝖾𝗇𝗍𝖡𝗈𝗈𝗌𝗍𝗂𝗇𝗀𝖢𝗅𝖺𝗌𝗌𝗂𝖿𝗂𝖾𝗋
𝖿𝗋𝗈𝗆 𝗌𝗄𝗅𝖾𝖺𝗋𝗇.𝗆𝗈𝖽𝖾𝗅_𝗌𝖾𝗅𝖾𝖼𝗍𝗂𝗈𝗇 𝗂𝗆𝗉𝗈𝗋𝗍 𝗍𝗋𝖺𝗂𝗇_𝗍𝖾𝗌𝗍_𝗌𝗉𝗅𝗂𝗍, 𝖼𝗋𝗈𝗌𝗌_𝗏𝖺𝗅_𝗌𝖼𝗈𝗋𝖾
𝖿𝗋𝗈𝗆 𝗌𝗄𝗅𝖾𝖺𝗋𝗇.𝗉𝗋𝖾𝗉𝗋𝗈𝖼𝖾𝗌𝗌𝗂𝗇𝗀 𝗂𝗆𝗉𝗈𝗋𝗍 𝖲𝗍𝖺𝗇𝖽𝖺𝗋𝖽𝖲𝖼𝖺𝗅𝖾𝗋, 𝖫𝖺𝖻𝖾𝗅𝖤𝗇𝖼𝗈𝖽𝖾𝗋
𝖿𝗋𝗈𝗆 𝗌𝗄𝗅𝖾𝖺𝗋𝗇.𝗆𝖾𝗍𝗋𝗂𝖼𝗌 𝗂𝗆𝗉𝗈𝗋𝗍 𝖼𝗅𝖺𝗌𝗌𝗂𝖿𝗂𝖼𝖺𝗍𝗂𝗈𝗇_𝗋𝖾𝗉𝗈𝗋𝗍, 𝖼𝗈𝗇𝖿
𝗎𝗌𝗂𝗈𝗇_𝗆𝖺𝗍𝗋𝗂𝗑
𝖿𝗋𝗈𝗆 𝗌𝗄𝗅𝖾𝖺𝗋𝗇.𝖼𝗅𝗎𝗌𝗍𝖾𝗋 𝗂𝗆𝗉𝗈𝗋𝗍 𝖪𝖬𝖾𝖺𝗇𝗌
𝗂𝗆𝗉𝗈𝗋𝗍 𝗆𝖺𝗍𝗉𝗅𝗈𝗍𝗅𝗂𝖻.𝗉𝗒𝗉𝗅𝗈𝗍 𝖺𝗌 𝗉𝗅𝗍
𝗂𝗆𝗉𝗈𝗋𝗍 𝗌𝖾𝖺𝖻𝗈𝗋𝗇 𝖺𝗌 𝗌𝗇𝗌
𝖿𝗋𝗈𝗆 𝖽𝖺𝗍𝖾𝗍𝗂𝗆𝖾 𝗂𝗆𝗉𝗈𝗋𝗍 𝖽𝖺𝗍𝖾𝗍𝗂𝗆𝖾, 𝗍𝗂𝗆𝖾𝖽𝖾𝗅𝗍𝖺
Lets explore some new features from existing patient raw data.
# Convert datetime
raw_data['visit_datetime']= pd.to_datetime(raw_data['visit_date'] + ' '+ raw_data['visit_time'])
𝖿𝗈𝗋 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝗂𝖽 𝗂𝗇 𝗋𝖺𝗐_𝖽𝖺𝗍𝖺['𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝗂𝖽'].𝗎𝗇𝗂𝗊𝗎𝖾():
𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺 = 𝗋𝖺𝗐_𝖽𝖺𝗍𝖺[𝗋𝖺𝗐_𝖽𝖺𝗍𝖺['𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝗂𝖽']== 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝗂𝖽].𝖼𝗈𝗉𝗒()
𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺 = 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺.𝗌𝗈𝗋𝗍_𝗏𝖺𝗅𝗎𝖾𝗌('𝗏𝗂𝗌𝗂𝗍_𝖽𝖺𝗍𝖾𝗍𝗂𝗆𝖾')
#𝟣.𝖡𝖺𝗌𝗂𝖼 𝖣𝖾𝗆𝗈𝗀𝗋𝖺𝗉𝗁𝗂𝖼𝗌
𝖿𝖾𝖺𝗍𝗎𝗋𝖾𝗌 = {
'𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝗂𝖽': 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝗂𝖽,
'𝖺𝗀𝖾': 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝖺𝗀𝖾'].𝗂𝗅𝗈𝖼[𝟢],
'𝗀𝖾𝗇𝖽𝖾𝗋': 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝗀𝖾𝗇𝖽𝖾𝗋'].𝗂𝗅𝗈𝖼[𝟢],
'𝗂𝗇𝗌𝗎𝗋𝖺𝗇𝖼𝖾_𝗍𝗒𝗉𝖾': 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝗂𝗇𝗌𝗎𝗋𝖺𝗇𝖼𝖾'].𝗂𝗅𝗈𝖼[𝟢]
}
#𝟤.𝖵𝗂𝗌𝗂𝗍 𝖥𝖾𝖺𝗍𝗎𝗋𝖾𝗌
𝖿𝖾𝖺𝗍𝗎𝗋𝖾𝗌.𝗎𝗉𝖽𝖺𝗍𝖾(
{
'𝗍𝗈𝗍𝖺𝗅_𝗏𝗂𝗌𝗂𝗍𝗌': 𝗅𝖾𝗇(𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺),
'𝗏𝗂𝗌𝗂𝗍𝗌_𝗉𝖾𝗋_𝗆𝗈𝗇𝗍𝗁': 𝗅𝖾𝗇(𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺) / 𝟣𝟤, # 𝖠𝗌𝗌𝗎𝗆𝗂𝗇𝗀 𝟣 𝗒𝖾𝖺𝗋 𝗈𝖿 𝖽𝖺𝗍𝖺
'𝗎𝗇𝗂𝗊𝗎𝖾_𝗏𝗂𝗌𝗂𝗍_𝖽𝖺𝗍𝖾𝗌': 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝗏𝗂𝗌𝗂𝗍_𝖽𝖺𝗍𝖾'].𝗇𝗎𝗇𝗂𝗊𝗎𝖾(),
'𝗆𝗎𝗅𝗍𝗂𝗉𝗅𝖾_𝗏𝗂𝗌𝗂𝗍𝗌_𝗌𝖺𝗆𝖾_𝖽𝖺𝗒': 𝗅𝖾𝗇(𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺) - 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝗏𝗂𝗌𝗂𝗍_𝖽𝖺𝗍𝖾'].𝗇𝗎𝗇𝗂𝗊𝗎𝖾()
}
)
#𝟥.𝖳𝖾𝗆𝗉𝗈𝗋𝖺𝗅 𝖯𝖺𝗍𝗍𝖾𝗋𝗇 𝖥𝖾𝖺𝗍𝗎𝗋𝖾𝗌
𝗂𝖿 𝗅𝖾𝗇(𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺)> 𝟣:
𝗏𝗂𝗌𝗂𝗍_𝗂𝗇𝗍𝖾𝗋𝗏𝖺𝗅𝗌 = 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝗏𝗂𝗌𝗂𝗍_𝖽𝖺𝗍𝖾𝗍𝗂𝗆𝖾'].𝖽𝗂𝖿𝖿().𝖽𝗍.𝖽𝖺𝗒𝗌.𝖽𝗋𝗈𝗉𝗇𝖺()
𝖿𝖾𝖺𝗍𝗎𝗋𝖾𝗌.𝗎𝗉𝖽𝖺𝗍𝖾({
'𝖺𝗏𝗀_𝖽𝖺𝗒𝗌_𝖻𝖾𝗍𝗐𝖾𝖾𝗇_𝗏𝗂𝗌𝗂𝗍𝗌': 𝗏𝗂𝗌𝗂𝗍_𝗂𝗇𝗍𝖾𝗋𝗏𝖺𝗅𝗌.𝗆𝖾𝖺𝗇(),
'𝗌𝗍𝖽_𝖽𝖺𝗒𝗌_𝖻𝖾𝗍𝗐𝖾𝖾𝗇_𝗏𝗂𝗌𝗂𝗍𝗌': 𝗏𝗂𝗌𝗂𝗍_𝗂𝗇𝗍𝖾𝗋𝗏𝖺𝗅𝗌.𝗌𝗍𝖽(),
'𝗆𝗂𝗇_𝖽𝖺𝗒𝗌_𝖻𝖾𝗍𝗐𝖾𝖾𝗇_𝗏𝗂𝗌𝗂𝗍𝗌': 𝗏𝗂𝗌𝗂𝗍_𝗂𝗇𝗍𝖾𝗋𝗏𝖺𝗅𝗌.𝗆𝗂𝗇(),
'𝗆𝖺𝗑_𝖽𝖺𝗒𝗌_𝖻𝖾𝗍𝗐𝖾𝖾𝗇_𝗏𝗂𝗌𝗂𝗍𝗌': 𝗏𝗂𝗌𝗂𝗍_𝗂𝗇𝗍𝖾𝗋𝗏𝖺𝗅𝗌.𝗆𝖺𝗑(),
'𝗊𝗎𝗂𝖼𝗄_𝗋𝖾𝗍𝗎𝗋𝗇𝗌_𝟤𝟦𝗁': (𝗏𝗂𝗌𝗂𝗍_𝗂𝗇𝗍𝖾𝗋𝗏𝖺𝗅𝗌 <= 𝟣).𝗌𝗎𝗆(),
'𝗊𝗎𝗂𝖼𝗄_𝗋𝖾𝗍𝗎𝗋𝗇𝗌_𝟦𝟪𝗁': (𝗏𝗂𝗌𝗂𝗍_𝗂𝗇𝗍𝖾𝗋𝗏𝖺𝗅𝗌 <= 𝟤).𝗌𝗎𝗆(),
'𝗍𝗈𝗍𝖺𝗅_𝗍𝗂𝗆𝖾𝗌𝗉𝖺𝗇_𝖽𝖺𝗒𝗌': (𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝗏𝗂𝗌𝗂𝗍_𝖽𝖺𝗍𝖾𝗍𝗂𝗆𝖾'].𝗆𝖺𝗑()- 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝗏𝗂𝗌𝗂𝗍_𝖽𝖺𝗍𝖾𝗍𝗂𝗆𝖾'].𝗆𝗂𝗇()).𝖽𝖺𝗒𝗌
})
#𝟦.𝖢𝗅𝗂𝗇𝗂𝖼𝖺𝗅 𝖥𝖾𝖺𝗍𝗎𝗋𝖾𝗌
𝖿𝖾𝖺𝗍𝗎𝗋𝖾𝗌.𝗎𝗉𝖽𝖺𝗍𝖾({
'𝖺𝗏𝗀_𝗍𝗋𝗂𝖺𝗀𝖾_𝗌𝖼𝗈𝗋𝖾': 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝗍𝗋𝗂𝖺𝗀𝖾_𝗌𝖼𝗈𝗋𝖾'].𝗆𝖾𝖺𝗇(),
'𝗆𝖺𝗑_𝗍𝗋𝗂𝖺𝗀𝖾_𝗌𝖼𝗈𝗋𝖾': 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝗍𝗋𝗂𝖺𝗀𝖾_𝗌𝖼𝗈𝗋𝖾'].𝗆𝖺𝗑(),
'𝗆𝗂𝗇_𝗍𝗋𝗂𝖺𝗀𝖾_𝗌𝖼𝗈𝗋𝖾': 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝗍𝗋𝗂𝖺𝗀𝖾_𝗌𝖼𝗈𝗋𝖾'].𝗆𝗂𝗇(),
'𝗍𝗋𝗂𝖺𝗀𝖾_𝗌𝖼𝗈𝗋𝖾_𝗏𝖺𝗋𝗂𝖺𝗇𝖼𝖾': 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝗍𝗋𝗂𝖺𝗀𝖾_𝗌𝖼𝗈𝗋𝖾'].𝗏𝖺𝗋(),
'𝖺𝗏𝗀_𝗅𝖾𝗇𝗀𝗍𝗁_𝗈𝖿_𝗌𝗍𝖺𝗒': 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝗅𝖾𝗇𝗀𝗍𝗁_𝗈𝖿_𝗌𝗍𝖺𝗒'].𝗆𝖾𝖺𝗇(),
'𝗆𝖺𝗑_𝗅𝖾𝗇𝗀𝗍𝗁_𝗈𝖿_𝗌𝗍𝖺𝗒': 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝗅𝖾𝗇𝗀𝗍𝗁_𝗈𝖿_𝗌𝗍𝖺𝗒'].𝗆𝖺𝗑(),
'𝗍𝗈𝗍𝖺𝗅_𝗅𝖾𝗇𝗀𝗍𝗁_𝗈𝖿_𝗌𝗍𝖺𝗒': 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝗅𝖾𝗇𝗀𝗍𝗁_𝗈𝖿_𝗌𝗍𝖺𝗒'].𝗌𝗎𝗆()
})
# 𝟧. 𝖡𝖾𝗁𝖺𝗏𝗂𝗈𝗋𝖺𝗅 𝖥𝖾𝖺𝗍𝗎𝗋𝖾𝗌
𝖿𝖾𝖺𝗍𝗎𝗋𝖾𝗌.𝗎𝗉𝖽𝖺𝗍𝖾({
'𝗇𝗂𝗀𝗁𝗍_𝗏𝗂𝗌𝗂𝗍𝗌_𝗋𝖺𝗍𝗂𝗈': 𝗅𝖾𝗇(𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺[𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝗏𝗂𝗌𝗂𝗍_𝖽𝖺𝗍𝖾𝗍𝗂𝗆𝖾'].𝖽𝗍.𝗁𝗈𝗎𝗋.𝖻𝖾𝗍𝗐𝖾𝖾𝗇(𝟤𝟢, 𝟨)]) / 𝗅𝖾𝗇(𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺ta)
'𝗐𝖾𝖾𝗄𝖾𝗇𝖽_𝗏𝗂𝗌𝗂𝗍𝗌_𝗋𝖺𝗍𝗂𝗈': 𝗅𝖾𝗇(𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺[𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝗏𝗂𝗌𝗂𝗍_𝖽𝖺𝗍𝖾𝗍𝗂𝗆𝖾'].𝖽𝗍.𝖽𝖺𝗒𝗈𝖿𝗐𝖾𝖾𝗄 >
= 𝟧]) / 𝗅𝖾𝗇(𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺ta)
'𝗐𝖾𝖾𝗄𝖽𝖺𝗒_𝗏𝗂𝗌𝗂𝗍𝗌_𝗋𝖺𝗍𝗂𝗈': 𝗅𝖾𝗇(𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺[𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝗏𝗂𝗌𝗂𝗍_𝖽𝖺𝗍𝖾𝗍𝗂𝗆𝖾'].𝖽𝗍.𝖽𝖺𝗒𝗈𝖿𝗐𝖾𝖾𝗄 < 𝟧]) / 𝗅𝖾𝗇(𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺)
'𝗆𝗈𝗋𝗇𝗂𝗇𝗀_𝗏𝗂𝗌𝗂𝗍𝗌_𝗋𝖺𝗍𝗂𝗈': 𝗅𝖾𝗇(𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺[𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝗏𝗂𝗌𝗂𝗍_𝖽𝖺𝗍𝖾𝗍𝗂𝗆𝖾'].𝖽𝗍.𝗁𝗈𝗎𝗋.𝖻𝖾𝗍𝗐𝖾𝖾𝗇(𝟨, 𝟣𝟤)]) / 𝗅𝖾𝗇(𝗉𝖺𝗍𝗂𝖾𝗇𝗍_data)
'𝖾𝗏𝖾𝗇𝗂𝗇𝗀_𝗏𝗂𝗌𝗂𝗍𝗌_𝗋𝖺𝗍𝗂𝗈': 𝗅𝖾𝗇(𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺[𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝗏𝗂𝗌𝗂𝗍_𝖽𝖺𝗍𝖾𝗍𝗂𝗆𝖾'].𝖽𝗍.𝗁𝗈𝗎𝗋.𝖻𝖾𝗍𝗐𝖾𝖾𝗇(𝟣𝟪,𝟤𝟥)]) / 𝗅𝖾𝗇(𝗉𝖺𝗍𝗂𝖾𝗇
})
# 𝟨. 𝖣𝗂𝗌𝖼𝗁𝖺𝗋𝗀𝖾 𝖯𝖺𝗍𝗍𝖾𝗋𝗇 𝖥𝖾𝖺𝗍𝗎𝗋𝖾𝗌
𝖽𝗂𝗌𝖼𝗁𝖺𝗋𝗀𝖾_𝖼𝗈𝗎𝗇𝗍𝗌 = 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝖽𝗂𝗌𝖼𝗁𝖺𝗋𝗀𝖾_𝖽𝗂𝗌𝗉𝗈𝗌𝗂𝗍𝗂𝗈𝗇'].𝗏𝖺𝗅𝗎𝖾_𝖼𝗈𝗎𝗇𝗍𝗌()
𝗍𝗈𝗍𝖺𝗅_𝗏𝗂𝗌𝗂𝗍𝗌 = 𝗅𝖾𝗇(𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺)
𝖿𝖾𝖺𝗍𝗎𝗋𝖾𝗌.𝗎𝗉𝖽𝖺𝗍𝖾({
'𝖽𝗂𝗌𝖼𝗁𝖺𝗋𝗀𝖾𝖽_𝗋𝖺𝗍𝗂𝗈': 𝖽𝗂𝗌𝖼𝗁𝖺𝗋𝗀𝖾_𝖼𝗈𝗎𝗇𝗍𝗌.𝗀𝖾𝗍('𝖣𝗂𝗌𝖼𝗁𝖺𝗋𝗀𝖾𝖽', 𝟢) / 𝗍𝗈𝗍𝖺𝗅_𝗏𝗂𝗌𝗂𝗍𝗌,
'𝗅𝖾𝖿𝗍_𝖺𝗆𝖺_𝗋𝖺𝗍𝗂𝗈': 𝖽𝗂𝗌𝖼𝗁𝖺𝗋𝗀𝖾_𝖼𝗈𝗎𝗇𝗍𝗌.𝗀𝖾𝗍('𝖫𝖾𝖿𝗍 𝖠𝖬𝖠', 𝟢) / 𝗍𝗈𝗍𝖺𝗅_𝗏𝗂𝗌𝗂𝗍𝗌,
'𝖺𝖽𝗆𝗂𝗍𝗍𝖾𝖽_𝗋𝖺𝗍𝗂𝗈': 𝖽𝗂𝗌𝖼𝗁𝖺𝗋𝗀𝖾_𝖼𝗈𝗎𝗇𝗍𝗌.𝗀𝖾𝗍('𝖠𝖽𝗆𝗂𝗍𝗍𝖾𝖽', 𝟢) / 𝗍𝗈𝗍𝖺𝗅_𝗏𝗂𝗌𝗂𝗍𝗌,
'𝖽𝗂𝗌𝖼𝗁𝖺𝗋𝗀𝖾_𝗉𝖺𝗍𝗍𝖾𝗋𝗇_𝖽𝗂𝗏𝖾𝗋𝗌𝗂𝗍𝗒': 𝗅𝖾𝗇(𝖽𝗂𝗌𝖼𝗁𝖺𝗋𝗀𝖾_𝖼𝗈𝗎𝗇𝗍𝗌)
})
#𝟩.𝖢𝗁𝗂𝖾𝖿 𝖢𝗈𝗆𝗉𝗅𝖺𝗂𝗇𝗍 𝖥𝖾𝖺𝗍𝗎𝗋𝖾𝗌
𝖼𝗈𝗆𝗉𝗅𝖺𝗂𝗇𝗍𝗌 = 𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺['𝖼𝗁𝗂𝖾𝖿_𝖼𝗈𝗆𝗉𝗅𝖺𝗂𝗇𝗍'].𝗌𝗍𝗋.𝗅𝗈𝗐𝖾𝗋()
𝖼𝗈𝗆𝗉𝗅𝖺𝗂𝗇𝗍_𝖼𝗈𝗎𝗇𝗍𝗌 = 𝖢𝗈𝗎𝗇𝗍𝖾𝗋(𝖼𝗈𝗆𝗉𝗅𝖺𝗂𝗇𝗍𝗌)
𝖿𝖾𝖺𝗍𝗎𝗋𝖾𝗌.𝗎𝗉𝖽𝖺𝗍𝖾({
'𝗎𝗇𝗂𝗊𝗎𝖾_𝖼𝗈𝗆𝗉𝗅𝖺𝗂𝗇𝗍𝗌': 𝗅𝖾𝗇(𝖼𝗈𝗆𝗉𝗅𝖺𝗂𝗇𝗍_𝖼𝗈𝗎𝗇𝗍𝗌),
'𝗆𝗈𝗌𝗍_𝖿𝗋𝖾𝗊𝗎𝖾𝗇𝗍_𝖼𝗈𝗆𝗉𝗅𝖺𝗂𝗇𝗍_𝗋𝖺𝗍𝗂𝗈': 𝗆𝖺𝗑(𝖼𝗈𝗆𝗉𝗅𝖺𝗂𝗇𝗍_𝖼𝗈𝗎𝗇𝗍𝗌.𝗏𝖺𝗅𝗎𝖾𝗌()) / 𝗅𝖾𝗇(𝗉𝖺𝗍𝗂𝖾𝗇𝗍_𝖽𝖺𝗍𝖺) 𝗂𝖿 𝖼𝗈𝗆𝗉𝗅𝖺𝗂𝗇𝗍_𝖼𝗈𝗎𝗇𝗍𝗌 𝖾
'𝗉𝖺𝗂𝗇_𝗋𝖾𝗅𝖺𝗍𝖾𝖽_𝗏𝗂𝗌𝗂𝗍𝗌': 𝖼𝗈𝗆𝗉𝗅𝖺𝗂𝗇𝗍𝗌.𝗌𝗍𝗋.𝖼𝗈𝗇𝗍𝖺𝗂𝗇𝗌('𝗉𝖺𝗂𝗇', 𝗇𝖺=𝖥𝖺𝗅𝗌𝖾).𝗌𝗎𝗆(),
'𝖺𝗇𝗑𝗂𝖾𝗍𝗒_𝗋𝖾𝗅𝖺𝗍𝖾𝖽_𝗏𝗂𝗌𝗂𝗍𝗌': 𝖼𝗈𝗆𝗉𝗅𝖺𝗂𝗇𝗍𝗌.𝗌𝗍𝗋.𝖼𝗈𝗇𝗍𝖺𝗂𝗇𝗌('𝖺𝗇𝗑𝗂𝖾𝗍𝗒', 𝗇𝖺=𝖥𝖺𝗅𝗌𝖾).𝗌𝗎𝗆(),
'𝖼𝗁𝖾𝗌𝗍_𝗉𝖺𝗂𝗇_𝗏𝗂𝗌𝗂𝗍𝗌': 𝖼𝗈𝗆𝗉𝗅𝖺𝗂𝗇𝗍𝗌.𝗌𝗍𝗋.𝖼𝗈𝗇𝗍𝖺𝗂𝗇𝗌('𝖼𝗁𝖾𝗌𝗍 𝗉𝖺𝗂𝗇', 𝗇𝖺=𝖥𝖺𝗅𝗌𝖾).𝗌𝗎𝗆(),
'𝖼𝗁𝗋𝗈𝗇𝗂𝖼_𝗆𝖾𝗇𝗍𝗂𝗈𝗇𝗌': 𝖼𝗈𝗆𝗉𝗅𝖺𝗂𝗇𝗍𝗌.𝗌𝗍𝗋.𝖼𝗈𝗇𝗍𝖺𝗂𝗇𝗌('𝖼𝗁𝗋𝗈𝗇𝗂𝖼', 𝗇𝖺=𝖥𝖺𝗅𝗌𝖾).𝗌𝗎𝗆()
})
And the above feature list keeps growing with your understanding of business problems deeply, and of course collaboration with domain experts. I've enjoyed deriving features from raw data that helped my team to gain insights into underlying problems.
The Curse of Dimensionality
Adding features isn't free—each additional dimension makes the feature space sparser and can hurt model performance, especially with limited training data. I've learned to balance feature richness with dataset size.
Conclusion:
Feature engineering is a great exercise to create a bridge between raw data and business value. The most important lesson I've learned is that effective feature engineering is iterative and experimental. Patience is the key!!!.