I think it's time for a serious talk about the lack of model generalizability in ML. It's so bad that I've long heard it taken as an inevitability that model performance will tank nearly immediately after launching to production. For the record, that's not how this is supposed to go. I'm seeing 3 main contributors to this phenomenon. And the good news is that they're all addressable: 1. Target leakage. This one is incredibly pervasive, and usually looks like including values in the feature space that are causally downstream of your target/outcome. If your accuracy looks too good to be true, draw a DAG (ideally, draw it before feature engineering) and check to make sure you're not "cheating" by training on future data that won't be available at time of inference in production. 2. Training on test. Sadly, this one is also much more common than you might think. "Fools, I would NEVER train on test", you may be saying. But consider the case with a lot of ensemble modeling approaches: you train a number of distinct ML models, and based on their performance on test, you either select the top performer or train the topmost "metalayer" of the ensemble model on the basis of test performance across the various models. This is training on test, and it sets you up for a major performance dip once you introduce new or live data. Hold out a true validation set that's fully independent from the process through which your select, refine, or build your final ML model. 3. The underlying ML approach is prone to overfit. This is especially common in higher-dimensional spaces with complex, nonlinear curvature and relations, especially when the training data is relatively sparse in this space. This is a problem we've been working on addressing through introducing methods with better global regularization, to fit to these complex spaces without overfit. It's a bit of tough love, but we as a community need to tighten up if we want to earn trust in ML and AI. I'd love to see a lot less talk about new tools and additions to the tech stack until we've got a firm grasp on the above and aren't constantly making excuses for models whose performance takes a nosedive immediately out of the gate. #ml #ai #machinelearning #data #causalinference #prediction
How to Address Overfitting in Machine Learning
Explore top LinkedIn content from expert professionals.
Summary
Overfitting in machine learning happens when a model performs well on its training data but struggles with new, unseen data because it has essentially "memorized" the training examples rather than learning the real patterns. Addressing overfitting means making models more reliable and better at generalizing to fresh data, which is crucial for trustworthy predictions.
- Use regularization methods: Apply techniques like L1 or L2 regularization to discourage your model from becoming too complex and fixating on the noise in your data.
- Validate and simplify: Always test your model on separate validation or test data and consider using simpler models that are less likely to latch onto random patterns.
- Curate diverse data: Instead of just adding more of the same kind of data, focus on collecting new, varied examples—especially the tricky cases your model finds challenging.
-
-
🚀 Deep Learning Playlist by Nitish Singh: Lectures 21–30 In Lectures 21–30, I moved deeper into improving, stabilizing, and optimizing neural networks. 🔍 Key Learnings 21: Improving Neural Network Performance Explored the core parameters that influence model performance: • Hidden layers & neurons • Learning rate • Batch size • Activation functions • Epochs Also learned common challenges like insufficient data, vanishing gradients, overfitting, and slow training — and how optimization methods, transfer learning, and regularization help. 22: Early Stopping Understood overfitting and how early stopping prevents it by monitoring validation loss. Learned how to tune “patience” and other parameters, and how tracking training vs validation curves shows when the model begins to memorize rather than learn. 23: Normalization & Standardization Learned why scaling inputs (like Age vs Salary) is essential for stable learning. • Normalization → [0,1] range • Standardization → mean=0, std=1 Applied these techniques and saw faster convergence and improved model stability. 24–25: Dropout (Theory + Practice) Dropout = randomly turning off neurons during training to avoid overfitting. Saw its effect on: • Regression • Classification Learned how dropout rate p changes model behavior (low p → overfitting, high p → underfitting) and how CNNs/RNNs need different ratios. 26–27: Regularization (L1/L2) Understood why overfitting happens and how L1 & L2 regularization reduce model complexity by penalizing large weights. Implemented L1/L2 and compared performance with vs without regularization. Also explored data augmentation and simplifying architecture. 28: Activation Functions — Dying ReLU Studied the dying ReLU problem, where neurons permanently output zero and stop learning. Causes include: • High learning rate • Negative bias Learned fixes: • Lower LR • Add positive bias • Use Leaky ReLU / PReLU to keep gradients flowing. 29–30: Weight Initialization (What NOT to do → What to do) Covered why bad initialization causes vanishing/exploding gradients. ❌ Zero initialization ❌ Same-value initialization ❌ Very small/very large random values Then learned correct methods: ✔ Xavier Initialization (for sigmoid/tanh) ✔ He Initialization (for ReLU/Leaky ReLU) Understanding initialization made it clear why deep networks need proper variance to train efficiently. 💡 Core Takeaways 🔹 Proper scaling, regularization, and initialization are just as important as architecture. 🔹 Overfitting can be controlled through dropout, early stopping, and L2 regularization. 🔹 Weight initialization + activation function pairing dramatically impacts training stability. 🔹 A well-tuned neural network learns faster, generalizes better, and avoids vanishing/exploding gradients. ✨ Reflection These lectures strengthened my understanding of why neural networks behave the way they do — and how small design choices can make a big difference in performance.
-
Exciting Advancements in Pinterest's Ad Conversion Models! I just read this fascinating paper from Pinterest's engineering team about their work on embedding table optimization and multi-epoch training for ad conversion models. The team tackled two key challenges in deep learning recommendation systems: 1. Slow convergence of embedding tables - They developed a "Sparse Optimizer" that applies a higher learning rate specifically to embedding tables (50x larger than other model components). This significantly improved convergence speed and model performance. 2. Multi-epoch overfitting - They observed that when training models for multiple epochs, performance would drop at epoch boundaries, especially for objectives with sparse labels. Their solution? A novel Frequency-Adaptive Learning Rate (FAL) approach that scales learning rates based on the log frequency of embedding rows, slowing down learning for infrequent IDs that are prone to overfitting. What's particularly interesting is how they discovered that multi-epoch overfitting varies in severity between different objectives in their multitask model. For example, the checkout prediction after a click (which has extremely sparse labels at just 0.2% of click density) showed much more severe overfitting than objectives with denser labels. They compared their FAL approach with an existing method called MEDA (Multi-Epoch learning with Data Augmentation) that re-initializes embedding tables between epochs. Both methods showed promising results, though interestingly, after several days of continual training with fresh data, the performance differences diminished. The Pinterest team's work demonstrates how careful optimization of embedding tables can significantly improve both training efficiency and model performance in large-scale recommendation systems.
-
You're in a Senior Computer Vision interview at Google and the interviewer drops this scenario: "We trained a high-capacity ResNet on 500k images, but it's still overfitting. My Product Manager wants to spend $20k to label another 500k random images scraped from the same source. Do you approve the budget?" Don't say: "Yes! Deep learning models are data-hungry. To fix high variance, we just need to feed the beast more data." That answer is how companies burn millions on compute with zero performance gain. The reality is that "𝘉𝘪𝘨 𝘋𝘢𝘵𝘢" is often just "𝘙𝘦𝘥𝘶𝘯𝘥𝘢𝘯𝘵 𝘋𝘢𝘵𝘢." If your model is overfitting, it means it has memorized the training set but fails on the validation set. Adding 500k more images from the exact same distribution (e.g., more sunny highway driving) often provides near-zero 𝐌𝐚𝐫𝐠𝐢𝐧𝐚𝐥 𝐈𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧 𝐆𝐚𝐢𝐧. You aren't teaching the model new concepts; you're just reinforcing its existing biases. The production bottleneck isn't 𝘷𝘰𝘭𝘶𝘮𝘦, it's 𝘤𝘰𝘷𝘦𝘳𝘢𝘨𝘦. It’s like studying for a calculus exam by memorizing "2+2=4" a thousand times. You have "more data" but you haven't expanded your knowledge manifold. ----- 𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧: You need 𝘈𝘤𝘵𝘪𝘷𝘦 𝘓𝘦𝘢𝘳𝘯𝘪𝘯𝘨. Instead of random scraping, run inference on the unlabeled pool and only pay to label the samples where the model's confidence is low or entropy is high. We don't need more data. We need 𝘏𝘢𝘳𝘥 𝘕𝘦𝘨𝘢𝘵𝘪𝘷𝘦𝘴 and edge cases that push the decision boundary. 𝐓𝐡𝐞 𝐀𝐧𝐬𝐰𝐞𝐫 𝐓𝐡𝐚𝐭 𝐆𝐞𝐭𝐬 𝐘𝐨𝐮 𝐇𝐢𝐫𝐞𝐝: "I would reject the budget. We don't need volume, we need variance. I’d use that budget to curate a smaller, higher-entropy dataset that targets the specific classes where the model is currently failing." #MachineLearning #DeepLearning #ComputerVision #DataScience #AICareers #EngineeringManager
-
🧵 1/ In high-dimensional bio data—transcriptomics, proteomics, metabolomics—you're almost guaranteed to find something “significant.” Even when there’s nothing there. 2/ Why? Because when you test 20,000 genes against a phenotype, some will look like they're associated. Purely by chance. It’s math, not meaning. 3/ Here’s the danger: You can build a compelling story out of noise. And no one will stop you—until it fails to replicate. 4/ As one paper put it: “Even if response and covariates are scientifically independent, some will appear correlated—just by chance.” That’s the trap. https://lnkd.in/ecNzUpJr 5/ High-dimensional data is a story-teller’s dream. And a statistician’s nightmare. So how do we guard against false discoveries? Let’s break it down. 6/ Problem: Spurious correlations Cause: Thousands of features, not enough samples Fix: Multiple testing correction (FDR, Bonferroni) Don’t just take p < 0.05 at face value. Read my blog on understanding multiple tests correction https://lnkd.in/ex3S3V5g 7/ Problem: Overfitting Cause: Model learns noise, not signal Fix: Regularization (LASSO, Ridge, Elastic Net) Penalize complexity. Force the model to be selective. read my blog post on regularization for scRNAseq marker selection https://lnkd.in/ekmM2Pvm 8/ Problem: Poor generalization Cause: The model only works on your dataset Fix: Cross-validation (k-fold, bootstrapping) Train on part of the data, test on the rest. Always. 9/ Want to take it a step further? Replicate in an independent dataset. If it doesn’t hold up in new data, it was probably noise. 10/ Another trick? Feature selection. Reduce dimensionality before modeling. Fewer variables = fewer false leads. 11/ Final strategy? Keep your models simple. Complexity fits noise. Simplicity generalizes. 12/ Here’s your cheat sheet: Problem : Spurious signals Fixes: FDR, Bonferroni, feature selection Problem: Overfitting Fixes:LASSO, Ridge, cross-validation Problem: Poor generalization Fixes: Replication, simpler models 13/ Remember: The more dimensions you have, the easier it is to find a pattern that’s not real. A result doesn’t become truth just because it passes p < 0.05. 14/ Key takeaways: High-dim data creates false signals Multiple corrections aren’t optional Simpler is safer Always validate Replication is king 15/ The story you tell with your data? Make sure it’s grounded in reality, not randomness. Because the most dangerous lie in science... is the one told by your own data. I hope you've found this post helpful. Follow me for more. Subscribe to my FREE newsletter chatomics to learn bioinformatics https://lnkd.in/erw83Svn
-
Too few parameters? Your MMM might underfit. Too many parameters? It will probably hallucinate. Every marketing model lives somewhere on this bias-variance tradeoff. So how do you find the sweet spot? A rough way to think about it is this: bias is your average error on in-sample data. Variance is how much additional error shows up when you test the model on held-out data. The trick is that as you add more parameters, your in-sample error (bias) goes down, but your out-of-sample error (variance) starts going up. Eventually, you reach a point where adding more complexity hurts your model’s performance. Let’s say you have 100 observations. Once you add 100 parameters, you can fit that dataset perfectly and have zero in-sample error. But try that same model on new data, and it will fall apart. Personally, I think of variance as that extra penalty – how much worse the model does out of sample compared to in-sample. And when variance grows faster than bias declines, your total error gets worse. This is something we had to solve for at Recast. Our underlying model needs enough complexity to capture true signal, but not so much that it overfits. There’s no way around the bias-variance tradeoff, but there are ways to navigate it. That’s why we use techniques like regularization to reduce the effective number of parameters, and cross-validation to check how we’re doing. It’s a fine line to walk, but you have to know where your model stands in this tradeoff.
-
In deep learning, regularization is a technique to prevent overfitting, a bit like a student memorizing answers for a test but struggling with real-life applications. With regularization, you can make the model perform well on unseen data. Popular Regularization Techniques: 1) Dropout Imagine a basketball team where each game, random players are benched. This way, the team doesn’t over-rely on a few star players, making everyone step up. Similarly, dropout “drops” certain neurons during training, preventing the network from becoming overly dependent on specific ones. 2) L2 Regularization (Weight Decay) Think of this like packing light for a hike. By keeping your load (or “weights”) lighter, you stay more agile and adaptable. L2 regularization adds a small penalty to large weights, pushing the model to have simpler, more adaptable representations. 3) Early Stopping Picture a runner preparing for a race—they stop training when they’ve reached peak fitness. Similarly, early stopping halts training when model performance stops improving, preventing overfitting and keeping it at its best. 4) Data Augmentation Imagine studying for an exam by practicing different types of questions. Data augmentation creates varied versions of data, like flipping or rotating images, helping models learn to recognize patterns from different angles and contexts. What’s your go-to regularization technique? Share below!
-
𝐎𝐯𝐞𝐫𝐟𝐢𝐭𝐭𝐢𝐧𝐠 𝐢𝐬 𝐭𝐡𝐞 𝐬𝐢𝐥𝐞𝐧𝐭 𝐤𝐢𝐥𝐥𝐞𝐫 𝐨𝐟 𝐌𝐋 𝐦𝐨𝐝𝐞𝐥𝐬. And chances are, you’ve already run into it without knowing. Let’s break it down simply, practically, and with zero nonsense. 𝗪𝐡𝐚𝐭 𝐞𝐱𝐚𝐜𝐭𝐥𝐲 𝐢𝐬 𝐎𝐯𝐞𝐫𝐟𝐢𝐭𝐭𝐢𝐧𝐠? ⤷ It’s when your model performs well on the training data but fails on new/unseen data. ⤷ It doesn’t learn patterns it memorizes. ⤷ It’s like preparing for an exam by memorizing last year’s paper word-for-word. 𝐑𝐞𝐚𝐥 𝐄𝐱𝐚𝐦𝐩𝐥𝐞: Let’s say you train a model to predict house prices. It sees a house with a red door and high price in training. Now it thinks red doors always mean high price even when that’s false in real life. That’s overfitting: learning noise, not truth. 𝗖𝐨𝐦𝐦𝐨𝐧 𝐒𝐢𝐠𝐧𝐬 𝐎𝐟 𝐎𝐯𝐞𝐫𝐟𝐢𝐭𝐭𝐢𝐧𝐠: ⤷ Training accuracy: 98% ⤷ Test accuracy: 65% ⤷ The gap is real. ⤷ Your model fails in the real world, even though the training looked perfect. 𝗖𝐚𝐮𝐬𝐞𝐬: ⤷ Too many parameters ⤷ Not enough training data ⤷ Too many training epochs ⤷ Lack of regularization ⤷ Complex model for a simple task 𝗛𝐨𝐰 𝐓𝐨 𝐅𝐢𝐱 𝐈𝐭 𝐋𝐢𝐤𝐞 𝐀 𝐏𝐫𝐨: ⤷ 𝐒𝐢𝐦𝐩𝐥𝐢𝐟𝐲 𝐭𝐡𝐞 𝐦𝐨𝐝𝐞𝐥 Use fewer layers or smaller trees. ⤷ 𝐀𝐝𝐝 𝐑𝐞𝐠𝐮𝐥𝐚𝐫𝐢𝐳𝐚𝐭𝐢𝐨𝐧 L1, L2, or dropout stop your model from getting too confident. ⤷ 𝐔𝐬𝐞 𝐄𝐚𝐫𝐥𝐲 𝐒𝐭𝐨𝐩𝐩𝐢𝐧𝐠 Stop training when validation loss starts increasing. ⤷ 𝐀𝐮𝐠𝐦𝐞𝐧𝐭 𝐘𝐨𝐮𝐫 𝐃𝐚𝐭𝐚 In image models, rotate/crop images to give more variety. ⤷ 𝐂𝐫𝐨𝐬𝐬-𝐯𝐚𝐥𝐢𝐝𝐚𝐭𝐞 Test your model across different splits not just one lucky test set. 𝐘𝐨𝐮𝐫 𝐌𝐢𝐬𝐬𝐢𝐨𝐧: Don’t just build models that work on paper. Build models that generalize that’s what makes you a real ML engineer. No one gets hired for models that only work in a Jupyter notebook. 𝐓𝐋;𝐃𝐑: ⤷ Overfitting = memorizing the training data ⤷ Causes: too complex models, small datasets ⤷ Fix with: regularization, early stopping, data augmentation ⤷ Goal: models that generalize, not just perform --- That's a wrap!! - Python 🐍 - AI/ML 🤖 - Data Science 🐼 - SW Dev 🛠 - AI Tools 🧰 - Roadmap ❗️ Find me → Arif Alam ✔️ Everyday, I share post on above topics.
-
When building predictive models, overfitting is a common challenge. Shrinkage methods, such as Ridge Regression, Lasso, and Elastic Net, help address this by adding a penalty term to the objective function during training, which discourages large coefficients. This results in more robust models that generalize better to new data. ✔️ Ridge Regression shrinks coefficients by penalizing their squared values, making it great when all features matter. ✔️ Lasso forces some coefficients to zero, effectively performing feature selection, ideal when only a subset of features is important. ✔️ Elastic Net combines the strengths of Ridge and Lasso, providing a balance between regularization and feature selection, especially useful when features are correlated. However, there are some challenges to consider: ❌ Loss of interpretability: Excessive shrinkage can make it difficult to interpret the model coefficients, as important predictors may have their effects reduced. ❌ Tuning required: These methods require careful tuning of hyperparameters (like λ and α) to find the right balance between bias and variance. Poor tuning can lead to either underfitting or overfitting. ❌ Not suitable for all situations: In some cases, simpler models like OLS (Ordinary Least Squares) might perform just as well or even better, especially when the sample size is large and multicollinearity isn’t an issue. 🔹 In R: Use the glmnet package to apply Ridge, Lasso, and Elastic Net. 🔹 In Python: Leverage the sklearn.linear_model module for all three shrinkage methods. Want to dive deeper into these methods and learn how to apply them? Join my online course on Statistical Methods in R, where we explore this and other key techniques in further detail. More info: https://lnkd.in/d-UAgcYf #programming #package #statisticsclass
-
Unlocking the Power of Dropout in Neural Networks for Superior Image Classification! In the field of deep learning, especially in high-dimensional data environments like image processing, models can sometimes not only learn but also memorize the training data. This phenomenon, known as overfitting, occurs when a model learns the detailed noise and random fluctuations in the training data to the extent that it negatively impacts performance on new, unseen data. 🔹 Why Does Overfitting Occur? Neural networks are excellent at pattern recognition by nature. They adjust their internal parameters (weights) to minimize the error between their predictions and the actual outputs. During this process, if a model has excessive capacity (too many parameters relative to the number of observations) and insufficient regularization, it may start fitting the noise instead of just the signal. This is similar to memorizing the answers for an exam rather than understanding the underlying concepts. 🔸 Introducing Dropout as a Solution Dropout is a key technique to combat overfitting. It works by randomly deactivating a subset of neurons in the network during training. This prevents neurons from co-adapting too strongly and encourages the model to develop a more robust, generalized understanding of the features. 🔹 Demonstration through Animation My latest animation showcases how applying Dropout to hidden layers effectively reduces overfitting. It visually demonstrates the flow of data from left to right through active neurons while inactive neurons (dropped) do not pass data. Did this animation help clarify the concept of Dropout for you? Share your feedback in the comments. #DeepLearning #ImageClassification #AI #MachineLearning #NeuralNetworks #DataScience
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development