Cross-validation exposes model fragility, not accuracy

My model hit 89% accuracy. I was proud of it. Then I tested it on different data. It dropped to 71%. Just like that. Same model. Same code. Totally different result. I had no explanation. The problem wasn't the model. It was how I was testing it. I was splitting my data once, 80% train, 20% test, trusting whatever number came out. My model wasn't learning real patterns. It was memorising that one specific slice of data. Cross-validation changed how I think about this completely. Instead of trusting one number, you get five. But here's what nobody told me early on: The standard deviation matters more than the mean. Mean: 0.87 │ Std: 0.02 → Stable. Trust it Mean: 0.87 │ Std: 0.12 → Fragile. Dig deeper Both look identical on a single split. Cross-validation exposes the truth. A single accuracy number isn't a result. It's a guess. I now run this before trusting any model, because a model that only works on the data you showed it isn't a model. It's just an expensive lookup table. Have you ever confidently presented a model that later turned out to be wrong? 👇 #MachineLearning #Python #DataScience #CrossValidation #LearningInPublic

To view or add a comment, sign in

More Relevant Posts

Priscilla Nzula
6d
Report this post
🔷A simple train test split is not always enough. I learned this the hard way when my model looked great on paper and struggled on real data. 📌Here is what nobody tells you about splitting data properly. The basic split gives you two sets. Training and testing. That works for simple projects. But what if you need to tune your model? You test different settings, pick the best one, and evaluate on the test set. The problem is that you have now indirectly used the test set to make decisions. It is no longer a fair judge. This is where a three way split becomes important. 🔹X_train, X_temp, y_train, y_temp = train_test_split( X, y, test_size=0.3, random_state=42 ) 🔹X_val, X_test, y_val, y_test = train_test_split( X_temp, y_temp, test_size=0.5, random_state=42 ) Now you have three sets. Training set. The model learns here. 70 percent of your data. Validation set. You tune and compare models here. 15 percent. Test set. You evaluate the final model here. Once. Never again. 15 percent. The test set is sacred. You look at it exactly one time at the very end. One more thing that most people miss. Always stratify your split when your target column is imbalanced. 🔹train_test_split(X, y, stratify=y, test_size=0.2) stratify=y makes sure both sets have the same proportion of each class. Without it you might end up with a training set that barely sees the minority class and a model that has no idea it exists. The split is not a formality. It is a decision that shapes every result that follows. Get it right before you touch anything else. ❓What split ratio do you use for your projects and why? #DataScience #MachineLearning #Python
Like Comment
To view or add a comment, sign in
Harish Pasumarthi
1mo
Report this post
Ever opened a dataset and thought… “why is this so messy?” 😅 Same here. While working with Pandas, I realized data cleaning isn’t complicated — it’s just a few powerful steps repeated smartly 👇 🧹 Missing values? → isna() to find them, fillna() or dropna() to handle them 🔁 Duplicate rows? → drop_duplicates() and move on 🔧 Wrong data types breaking your logic? → astype() fixes it in seconds 🧼 Messy text (extra spaces, weird formats)? → str.strip() and str.lower() clean it instantly 📊 Before trusting data? → info() and value_counts() give a quick reality check Good analysis starts with clean data first. That simple shift has already changed how I look at datasets. Still learning, but this is one of the most useful lessons so far. #DataAnalytics #Python #Pandas #DataCleaning #LearningJourney
Like Comment
To view or add a comment, sign in
Anuj Saini
3w
Report this post
80% of analysis time is data cleaning. Here's the playbook. Nobody posts about this part. It's not glamorous. But it's where the real work happens. This free notebook covers: → Identifying missing values (isnull, info, patterns) → Visualizing missingness — is it random or systematic? → Imputation strategies: mean, median, mode, forward fill → When to drop vs when to impute (decision framework) → Finding duplicates (exact and fuzzy) → Deduplication: keep first, keep last, custom logic → Validating your cleaned dataset Real messy data. Not textbook-clean CSVs. The kind of data you'll actually encounter at work. Free: https://lnkd.in/gBG_CBqH Day 2/7. Yesterday was SQL. Tomorrow: Advanced Pandas. #DataCleaning #Python #Pandas #DataAnalyst #DataScience #DataQuality #FreeResources #DataAnalytics

1 Comment
Like Comment
To view or add a comment, sign in
Faisal Khan
3w
Report this post
Real-world data is messy. In courses, we get clean CSVs. In business, we get schema drifts, missing values, and chaotic source systems. To solve actual problems, you need a bridge between how we store data and how we use data. That bridge is where the real value lives. It’s the shift from simply "cleaning" data to engineering reliable, scalable pipelines that the business can actually trust. Stop looking for the perfect dataset. Start building the bridge that creates it. 🏗️ #DataAnalytics #DataStrategy #DataEngineering #Python #SQL
4 Comments
Like Comment
To view or add a comment, sign in
Hamza Elkerdawy
1w
Report this post
Understanding your data is 80% of the job. 📊 Before jumping into complex models, you need to "interrogate" your dataset. If you don't understand the data from the ground up, you'll never find the real solution the business is looking for. I believe that EDA (Exploratory Data Analysis) is where the magic happens. It’s how you bridge the gap between raw numbers and actual insights. Here are the 5 essential functions I use to start any project: ⬇️ (Swipe right to see the toolkit!) Which one is your favorite? 👇 #DataAnalysis #Pandas #Python #EDA #BusinessIntelligence #IndustrialEngineering
Like Comment
To view or add a comment, sign in
peruri sai vasanth
4w
Report this post
🚀 Just built an end-to-end ML model to predict Insurance Charges! Worked on the classic insurance.csv dataset using Python, pandas, seaborn & scikit-learn. What I did: EDA + visualizations (age, BMI, smoker impact) Preprocessed data (OneHotEncoder + StandardScaler) Trained Linear Regression & Random Forest Regressor Model Results: Linear Regression: R² = 0.7836 | MAE = $4,181 Random Forest: R² = 0.8656 | MAE = $2,544 (Winner 🔥) Sample Prediction (40M, BMI 28.5, 2 kids, non-smoker, northwest): → Linear: $8,416 → Random Forest: $6,894 Great hands-on practice with regression pipelines! Would love your feedback 👇 Have you worked on similar projects? #DataScience #MachineLearning #Python #ScikitLearn #Regression
Like Comment
To view or add a comment, sign in
Gugulethu Mashika
3w
Report this post
So there’s this exciting concept in data called “imputation.” Okay it’s not that exciting, I just like the name, but it’s actually pretty important. It’s basically when you deal with missing values by filling them in using the rest of the dataset. Not in a vague “surrounding data” way, but using actual methods like mean, median, or mode, sometimes forward or backward fill, and in more serious cases even models to estimate what should be there. The other option is to just delete the missing data. Either drop the rows or even the whole column. This is common with large datasets, especially when the missing values are small enough that removing them won’t mess with the overall analysis. But it’s not something you just do blindly, because depending on why the data is missing, you can end up biasing your results without realizing it. So yeah, it sounds like a small step, but it actually matters. #LearningInPublic #Python #DataCleaning #DataAnalysis #Data
8 Comments
Like Comment
To view or add a comment, sign in
Anuj Saini
1w
Report this post
Pandas is about to get replaced. Not tomorrow. But in 2 years, half of you will have switched to Polars. And the other half will be wondering why their scripts are still slow. Polars is: → 5-30x faster than Pandas (on real benchmarks) → Memory-efficient (no more OOM errors on 10GB datasets) → Written in Rust (lazy evaluation, query optimization built in) → Has a cleaner, more consistent API than Pandas → Native support for streaming data (no chunking required) My free notebook walks through the fundamentals: → Polars DataFrames — creation, inspection, indexing → The expressions API (the thing that makes Polars fast) → Filtering, selecting, sorting — the Pandas equivalents → group_by with expressions (way cleaner than agg) → Lazy evaluation — query optimizer explained → Side-by-side Pandas vs Polars benchmarks If you've never heard of Polars, you're about to. Get ahead of the curve. https://lnkd.in/gDXKkV75 Day 2/7. #Polars #Python #DataEngineering #DataAnalytics #Pandas #Rust #DataFrames #OpenSource

9 Comments
Like Comment
To view or add a comment, sign in
Muhammad Abdullah
1w
Report this post
𝐈𝐟 𝐘𝐨𝐮 𝐃𝐨𝐧’𝐭 𝐔𝐧𝐝𝐞𝐫𝐬𝐭𝐚𝐧𝐝 𝐓𝐡𝐢𝐬 𝐏𝐫𝐨𝐛𝐥𝐞𝐦, 𝐘𝐨𝐮 𝐃𝐨𝐧’𝐭 𝐔𝐧𝐝𝐞𝐫𝐬𝐭𝐚𝐧𝐝 𝐒𝐭𝐚𝐜𝐤𝐬 Today I tackled a fundamental problem that looks simple at first — but really tests your understanding of logic and data structures. 💡 𝐓𝐡𝐞 𝐂𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐞: Given a string of brackets () { } [ ], determine whether it is valid. 🧠 𝐌𝐲 𝐀𝐩𝐩𝐫𝐨𝐚𝐜𝐡: Instead of checking everything at the end, I used a stack (𝐋𝐈𝐅𝐎 𝐩𝐫𝐢𝐧𝐜𝐢𝐩𝐥𝐞) to validate each step in real-time. • Push opening brackets • On closing bracket → match with the last opened one • If mismatch occurs → invalid • If everything matches & stack is empty → valid 🔥 𝐊𝐞𝐲 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠: This problem taught me how powerful simple data structures can be when used correctly. 🐍 𝐏𝐲𝐭𝐡𝐨𝐧 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧 👇 📌 Consistency in solving such problems is helping me build strong problem-solving skills. #Python #DSA #FullStack #AI #Logic #LeetCode #AIDriven
Like Comment
To view or add a comment, sign in
Djalila BENSALEM
2w
Report this post
🐍 Data Science tip: automate variable type detection before choosing your preprocessing strategy. One of the most overlooked steps in data preparation is correctly identifying the nature of each variable. Because imputation and transformation strategies depend entirely on variable type. Instead of guessing, you can systematically classify variables using simple Python logic: categorical = df.select_dtypes(include=['object', 'category']).columns numerical = df.select_dtypes(include=['int64', 'float64']).columns ordinal = [col for col in numerical if df[col].nunique() < 10] 💡 Then adapt your preprocessing strategy accordingly: Categorical → mode / encoding Numerical → mean or median Ordinal / discrete → careful handling (depends on context) 🔍 Key idea: Before choosing how to impute or transform data, you must first understand what type of variable you're working with. Good data science starts with structure, not models. #Python #DataScience #MachineLearning #DataEngineering #Pandas
Like Comment
To view or add a comment, sign in

468 followers

14 Posts

View Profile Follow

Cross-validation exposes model fragility, not accuracy

More Relevant Posts

Explore related topics

Explore content categories