Ehsan Ghoreishi’s Post

3 hidden ways ML models fail (even with good accuracy). Most data scientists know overfitting and underfitting. But data leakage? That’s the silent killer. Here’s a quick breakdown from the infographic: 🔹 Underfitting (High Bias) → Model is too simple. Misses patterns in data. → Solution: increase model complexity, add features. 🔹 Overfitting (High Variance) → Model memorizes training data, including noise and outliers. → Solution: simplify, regularize, or get more data. 🔹 Data Leakage (The Silent Killer) → Information from the future or test set leaks into training. → Results: spectacular validation metrics … total failure in production. → Solution: strict feature engineering, time‑based splits, and constant vigilance. Why this matters: A model that cheats (leakage) or overfits will never generalize. And a model that underfits leaves value on the table. #MachineLearning #DataScience #ModelValidation #Python #MLOps

To view or add a comment, sign in

More Relevant Posts

Lawrence Junior
1mo
Report this post
My model scored 100% accuracy....Yey But I didn't celebrate. Something felt wrong. A model that perfect on real-world messy data isn't a success it's a warning sign. So I went looking. Turns out I had been importing a dataset I previously worked on. It was corrupted. The model had essentially memorized answers it had already seen. The score was meaningless. I restarted from a clean dataset. Ran everything again properly. Restart & Run All, no shortcuts. This time the numbers were honest: 83.8% cross-validation accuracy. 88.7% ROC-AUC. Less impressive on the surface. Far more valuable in reality. Here's what the actual model pipeline looks like: I tested several algorithms on the same feature set. Logistic regression outperformed the others on this problem, binary classification, structured tabular data, limited sample size. Simple models often win when the problem fits them. This one did. The pipeline: → StandardScaler for numerical features (Age, Fare, FamilySize) → One-hot encoding for passenger title (Miss, Mr, Mrs, Rare) → FamilySize engineered as a composite feature → LogisticRegression wrapped in a scikit-learn Pipeline → Serialised with joblib for API serving The title encoding decision matters more than it looks. My first instinct was label encoding assigning integers to each title. That was wrong. Label encoding implies an order: Mr=1, Mrs=2, Miss=3. There is no such order. One-hot encoding treats each title as an independent binary flag. That's the correct representation. Catching that distinction early saved the model from learning a relationship that doesn't exist. I also found a bug during integration testing. The FamilySize feature was off by one, the player themselves wasn't being counted in their own family. A small error, but in a model where every feature matters, small errors compound. I documented it as a known issue rather than quietly patching it with a guess. Known bugs you can explain are better than hidden bugs you can't. This is post 2 of a series documenting how I built an AI-driven simulation game powered by a real ML pipeline. Next post: the FastAPI backend how the model went from a notebook to a live prediction endpoint. #MachineLearning #Python #ScikitLearn #MLEngineering
Like Comment
To view or add a comment, sign in
Mansi Kukrety
3w
Report this post
🚀 Feature Scaling & Transformation — With Real Example + Code Most people jump to models… but ignore feature scaling, which can literally make or break performance. 💡 Real-World Example Building a House Price Prediction Model 🏡 Features: - Size = 2000 sq.ft - Rooms = 3 👉 Without scaling → model gives more importance to size ❌ 👉 With scaling → fair contribution from both ✅ 🔥 Types of Scaling 📌 Min-Max Scaling (0–1 range) 📌 Standardization (mean = 0, std = 1) 📌 Robust Scaling (handles outliers) 📌 Normalization (unit vector scaling) 💻 Quick Python Code (Scikit-Learn) from sklearn.preprocessing import MinMaxScaler, StandardScaler data = [[2000, 3], [1500, 2], [1800, 4]] # Min-Max Scaling minmax = MinMaxScaler() scaled_minmax = minmax.fit_transform(data) # Standard Scaling standard = StandardScaler() scaled_standard = standard.fit_transform(data) print("MinMax:\n", scaled_minmax) print("Standard:\n", scaled_standard) 🔧 Feature Transformation ✔️ Log Transform → handles skewed data (e.g., salary) ✔️ Encoding → converts categories into numbers ⚠️ Pro Tip Always scale after train-test split to avoid data leakage. ✨ Final Thought Better data > Better model. #DataScience #MachineLearning #FeatureEngineering #Python #AI #Learning
Like Comment
To view or add a comment, sign in
Hazem Mohamed
2w
Report this post
Yesterday I decided to build a Multiple Linear Regression model simple, right? 😄 Well, not exactly. I ran into one of the weirdest issues I’ve ever seen in a dataset. I have my own data preprocessing template tested many times, reliable, and saves me a lot of time. So I trusted it 100%. But when I applied it and selected the independent and dependent variables I got results that made ZERO sense. At first, I thought: “Okay maybe I messed up something small.” Then I tried again. And again. And again. Same weird output. At this point, I started questioning everything even my own template 😅 Before giving up, I tried one last thing: Instead of selecting columns by index, I used column names. And suddenly everything worked perfectly 🤯 So I went back to investigate further And here’s the surprise: The column indices I was using didn’t match what actually existed in the dataset! 👉 Turns out there were hidden columns / unexpected structure issues messing with the indexing. Lesson learned: Never trust indices blindly Always double check your dataset structure And sometimes column names will save your life 😄 Debugging data > building models sometimes. Has anyone faced something like this before? #DataScience #MachineLearning #DataPreprocessing #Python #DataAnalytics #AI #Debugging

1 Comment
Like Comment
To view or add a comment, sign in
Soham Pawar
2w
Report this post
🚨 i spent like 5 hours yesterday tuning a model that just wouldn't learn. i was tweaking the learning rate and trying different architectures for this computer vision task. literally nothing worked. val accuracy was stuck and i was starting to feel pretty dumb. then i actually looked at the raw data again. turns out, about 30% of my training images were corrupted or mislabeled during the last scraping script i ran. i was trying to use a "smart" model to fix "stupid" data. 👉 what i realized: cleaning data is 90% of the job, even if it's the boring part. if the loss curve looks weird, check your CSV before you check your layers. fancy models won't save you from a messy dataset. cleaning the data took 10 minutes and the model trained fine after that. anyone else ever wasted a whole day on something this simple? #machinelearning #python #datascientist #ai
Like Comment
To view or add a comment, sign in
Qudus Oseni
4w
Report this post
Most ML models don’t fail because of bad algorithms. They fail because of bad data preparation. Feature engineering is the step most beginners skip or rush. But it’s often the difference between a model that works and one that actually performs. Here are 3 things I always check before training any model: 𝟭. 𝗠𝗶𝘀𝘀𝗶𝗻𝗴 𝗩𝗮𝗹𝘂𝗲𝘀 Missing data is not the end of the world. You can fill gaps using simple statistics like mean or median (univariate imputation), or go smarter with KNN imputation which looks at similar data points to estimate what’s missing. 𝟮. 𝗢𝘂𝘁𝗹𝗶𝗲𝗿𝘀 Outliers can silently wreck your model. I use the IQR method to catch them: anything below Q1 - (1.5×IQR) or above Q3 + (1.5×IQR) gets flagged. For normally distributed data, Z-scores do the job just as well. 𝟯. 𝗜𝗺𝗯𝗮𝗹𝗮𝗻𝗰𝗲𝗱 Data If your dataset has 95% of one class and 5% of another, your model will just learn to ignore the minority. Fix it by downsampling the majority class or upweighting the minority. Both work. Pick based on your data size. Get these three right and your model has a real shot. What part of feature engineering do you find most tricky? Drop it below 👇 #MachineLearning #DataScience #Python #MLEngineering #FeatureEngineering
Like Comment
To view or add a comment, sign in
Ahmet Duyar
3d
Report this post
My ML model had 100% accuracy. And was completely useless. That's not a paradox — that's overfitting. The model didn't learn. It memorised. Here's the mathematical core most tutorials skip: E[loss] = Bias² + Variance + σ² → Bias² = too simple → Underfitting → Variance = too complex → Overfitting → σ² = irreducible → always there What this actually means in practice: → A degree-9 polynomial on 6 data points hits R² = 1.0 and oscillates wildly between them → A linear model on sine-wave data has near-zero variance — but massive bias → The optimal model isn't the simplest. Not the most complex. It's the one minimising Bias² + Variance And the generalisation gap? Formally defined as: gen_gap(f) = R(f) − R_emp(f) When this value is ≫ 0, your model is learning noise not signal. The fix isn't "collect more data and hope." The fix is regularisation which I derive fully in my paper: L1, L2, Dropout, and Early Stopping, all from first principles. Which regularisation strategy do you use most and why? Drop your answer below ↓ I'll reply to every comment personally. #MachineLearning #DataScience #Statistics #MLOps #ArtificialIntelligence #DeepLearning #Python #DataAnalytics

1 Comment
Like Comment
To view or add a comment, sign in
Djalila BENSALEM
5d Edited
Report this post
Small detail. Big bug. Last week, a FutureWarning almost slipped into our ML pipeline. We were post-EDA, cleaning a dataset for model training. The task was simple: replace "Unknown" strings with NaN. Classic pandas: df.replace("Unknown", np.nan) 😬 Then came the warning: FutureWarning: Downcasting behavior in replace is deprecated. My first reaction? Try to silence it: pd.set_option('future.no_silent_downcasting', True) But here’s what I’ve learned from maintaining production systems: 👉 Never silence a FutureWarning. It’s not noise. It’s pandas telling you: “Your implicit assumptions about data types will break in a future version.” 🔍 What’s really happening Historically, replace() could silently convert integer columns into floats when introducing NaN. Pandas is now making this behavior explicit and warning you about it. 🫨 Silencing the warning doesn’t fix the issue. It hides a future type inconsistency. 💡 The senior approach Make type behavior explicit: df.replace("Unknown", np.nan).infer_objects(copy=False) Or even better, explicitly define your schema after cleaning, instead of relying on implicit type inference. Key takeaway 🟢 A warning is not a bug. Silencing it is. 🟢 In production data science, every silent assumption is a potential failure point. 🟢 Write code that makes behavior explicit, not code that hides uncertainty. #Python #DataScience #Pandas #MLOps #DataEngineering
Like Comment
To view or add a comment, sign in
Asadullah khan
1w Edited
Report this post
Just Built & Deployed My Machine Learning Project From dataset to trained ML model to deployed prediction application. I developed a California House Price Prediction System using Machine Learning and deployed it with Streamlit. The system predicts house prices based on important housing features such as: • Median Income • House Age • Total Rooms • Population • Latitude & Longitude Model Used RandomForestRegressor Tech Stack • Python • Pandas & NumPy • Scikit-learn • Random Forest Regression • Streamlit (for deployment) Live Demo https://lnkd.in/dW8FuqCU Source Code https://lnkd.in/dB7Z4cgx Model Performance Training Set Results MAE: 25,180 MSE: 1,431,165,852 RMSE: 37,830 Test Set Results MAE: 34,073 MSE: 2,587,975,219 RMSE: 50,872 R² Score: 0.81 These results indicate that the model captures housing price patterns reasonably well and generalizes effectively to unseen data. What I learned from this project • Data preprocessing and feature engineering • Training and evaluating regression models • Understanding error metrics such as MAE, MSE, RMSE, and R² • Deploying machine learning models using Streamlit Next Improvements • Hyperparameter tuning • Experimenting with advanced models such as XGBoost and Gradient Boosting • Adding visualization dashboards for deeper insights Feedback and suggestions are welcome. #MachineLearning #DataScience #MLEngineer #Python #AIProjects #Streamlit #DataAnalytics #ArchTechnologies
Like Comment
To view or add a comment, sign in
Zohib Khan
3w
Report this post
Cross-validation is a essential technique for assessing how well a model generalizes to unseen data. Relying solely on training set performance can lead to overfitting and poor real-world results. A robust cross-validation strategy provides a more reliable estimate of model performance by systematically testing on multiple data splits. Common cross-validation approaches include: k-fold cross-validation – splitting data into k subsets, training on k-1 and validating on the remaining fold, repeated k times Stratified k-fold – preserving class distribution in each fold for classification problems Time-series cross-validation – using expanding or rolling windows when temporal order matters Implementing proper cross-validation early in the workflow prevents overoptimistic performance estimates and leads to models that truly generalize. I prioritize cross-validation as a non‑negotiable step before any final model selection or hyperparameter tuning. #DataScience #MachineLearning #ModelValidation #CrossValidation #Python #AI
Like Comment
To view or add a comment, sign in
Reshma Mani
1w
Report this post
Most forecasting models FAIL in industrial environments. Why? Because: • Data is irregular • Transactions are high-value • Patterns are non-linear So I built a hybrid forecasting system Approach: → SARIMA for trend & seasonality → XGBoost & LightGBM for residual learning → Feature engineering (lags, rolling stats, macro signals) → Implemented entirely in Python Results: Baseline SARIMA → 10.9% error Hybrid model → 4.2% error That’s a ~60% improvement in accuracy. Key Insight: Combining statistical models with machine learning delivers far better results than using either alone — especially in real-world business data. Tech Stack: Python, Pandas, SARIMA, XGBoost, LightGBM This project helped me understand how theory translates into real business impact. #MachineLearning #DataScience #Python #AI #TimeSeries #Forecasting
Like Comment
To view or add a comment, sign in

1,177 followers

69 Posts

View Profile Connect

Ehsan Ghoreishi’s Post

More Relevant Posts

Explore related topics

Explore content categories