Proper Data Splitting for Machine Learning Models

🔷A simple train test split is not always enough. I learned this the hard way when my model looked great on paper and struggled on real data. 📌Here is what nobody tells you about splitting data properly. The basic split gives you two sets. Training and testing. That works for simple projects. But what if you need to tune your model? You test different settings, pick the best one, and evaluate on the test set. The problem is that you have now indirectly used the test set to make decisions. It is no longer a fair judge. This is where a three way split becomes important. 🔹X_train, X_temp, y_train, y_temp = train_test_split( X, y, test_size=0.3, random_state=42 ) 🔹X_val, X_test, y_val, y_test = train_test_split( X_temp, y_temp, test_size=0.5, random_state=42 ) Now you have three sets. Training set. The model learns here. 70 percent of your data. Validation set. You tune and compare models here. 15 percent. Test set. You evaluate the final model here. Once. Never again. 15 percent. The test set is sacred. You look at it exactly one time at the very end. One more thing that most people miss. Always stratify your split when your target column is imbalanced. 🔹train_test_split(X, y, stratify=y, test_size=0.2) stratify=y makes sure both sets have the same proportion of each class. Without it you might end up with a training set that barely sees the minority class and a model that has no idea it exists. The split is not a formality. It is a decision that shapes every result that follows. Get it right before you touch anything else. ❓What split ratio do you use for your projects and why? #DataScience #MachineLearning #Python

To view or add a comment, sign in

More Relevant Posts

Abdul Waseh
3w
Report this post
My model hit 89% accuracy. I was proud of it. Then I tested it on different data. It dropped to 71%. Just like that. Same model. Same code. Totally different result. I had no explanation. The problem wasn't the model. It was how I was testing it. I was splitting my data once, 80% train, 20% test, trusting whatever number came out. My model wasn't learning real patterns. It was memorising that one specific slice of data. Cross-validation changed how I think about this completely. Instead of trusting one number, you get five. But here's what nobody told me early on: The standard deviation matters more than the mean. Mean: 0.87 │ Std: 0.02 → Stable. Trust it Mean: 0.87 │ Std: 0.12 → Fragile. Dig deeper Both look identical on a single split. Cross-validation exposes the truth. A single accuracy number isn't a result. It's a guess. I now run this before trusting any model, because a model that only works on the data you showed it isn't a model. It's just an expensive lookup table. Have you ever confidently presented a model that later turned out to be wrong? 👇 #MachineLearning #Python #DataScience #CrossValidation #LearningInPublic
Like Comment
To view or add a comment, sign in
Harish Pasumarthi
1mo
Report this post
Ever opened a dataset and thought… “why is this so messy?” 😅 Same here. While working with Pandas, I realized data cleaning isn’t complicated — it’s just a few powerful steps repeated smartly 👇 🧹 Missing values? → isna() to find them, fillna() or dropna() to handle them 🔁 Duplicate rows? → drop_duplicates() and move on 🔧 Wrong data types breaking your logic? → astype() fixes it in seconds 🧼 Messy text (extra spaces, weird formats)? → str.strip() and str.lower() clean it instantly 📊 Before trusting data? → info() and value_counts() give a quick reality check Good analysis starts with clean data first. That simple shift has already changed how I look at datasets. Still learning, but this is one of the most useful lessons so far. #DataAnalytics #Python #Pandas #DataCleaning #LearningJourney
Like Comment
To view or add a comment, sign in
Jeba Jini
2w
Report this post
Data collection series · Post 07 Imputation strategies — beyond filling with the mean "Mean imputation is fast. It's also wrong in most cases. Here are 4 better strategies and exactly when to use each." Filling missing values with the mean is fast. It's also quietly wrong in most cases. Here are 4 better strategies — and exactly when to use each. ▼ Mean imputation is the default. Everyone learns it first. It's one line of code. It ships fast. But it has a serious flaw: It collapses variance. Replace 500 missing values with the mean — and your distribution gets an artificial spike right in the middle. Your correlations weaken. Your model learns a distorted world. There are better options. Here's the practical guide. --- #Python #DataScience #DataQuality #DataCleaning #Analytics #DataAnalyst #DataAnalytics #DataEngineering #Imputationstrategies
Like Comment
To view or add a comment, sign in
Djalila BENSALEM
2w
Report this post
🐍 Data Science tip: automate variable type detection before choosing your preprocessing strategy. One of the most overlooked steps in data preparation is correctly identifying the nature of each variable. Because imputation and transformation strategies depend entirely on variable type. Instead of guessing, you can systematically classify variables using simple Python logic: categorical = df.select_dtypes(include=['object', 'category']).columns numerical = df.select_dtypes(include=['int64', 'float64']).columns ordinal = [col for col in numerical if df[col].nunique() < 10] 💡 Then adapt your preprocessing strategy accordingly: Categorical → mode / encoding Numerical → mean or median Ordinal / discrete → careful handling (depends on context) 🔍 Key idea: Before choosing how to impute or transform data, you must first understand what type of variable you're working with. Good data science starts with structure, not models. #Python #DataScience #MachineLearning #DataEngineering #Pandas
Like Comment
To view or add a comment, sign in
Ravuri Sarath
2w
Report this post
I have used *args and **kwargs for years by copy-pasting patterns I found on Stack Overflow. Today I actually understand them. The simple version: *args = accept any number of positional arguments as a tuple **kwargs = accept any number of keyword arguments as a dictionary Why does this matter in data work? Imagine a validation function. You want it to accept any number of rules — not just 2, not just 5. Any number. Without *args: def validate(data, rule1, rule2, rule3): # what if I have 10 rules? pass With *args: def validate(data, *rules): for rule in rules: if not rule(data): print(f'Failed: {rule.__name__}') Now I can call: validate(df, check_nulls, check_schema, check_dates, check_amounts) Any number of rules. Clean interface. One function definition. **kwargs is for when the rules need configuration: validate(data, null_threshold=0.05, date_column='txn_date') The insight from Corey: *args and **kwargs are not advanced Python. They are the way Python lets functions be flexible. Once you see that, they become obvious. What patterns clicked for you only after someone explained WHY, not just HOW? ---- #Python #LearningInPublic #DataEngineering #CodingTips #PythonFunctions

1 Comment
Like Comment
To view or add a comment, sign in
Vansh Shah
1mo
Report this post
📊 Day 12 | Choosing the Right Test & Practical Tips 🧠📊 Today, I learned how to choose the right statistical test based on the data and problem. After exploring multiple statistical tests, I realized that the most important skill is not just knowing tests, but knowing when to use which test. The selection depends on: 🔹 Type of data (Numerical or Categorical) 🔹 Number of groups (1, 2, or more) 🔹 Relationship between data (independent or dependent) Some simple rules I learned: ✔ One group vs value → One-sample t-test ✔ Two independent groups → Two-sample t-test ✔ Same group (before/after) → Paired t-test ✔ More than two groups → ANOVA ✔ Categorical data → Chi-Square test I also learned some common mistakes: ❌ Relying only on p-value without understanding data ❌ Not checking assumptions like normality ❌ Misinterpreting results To understand this better, I applied multiple tests on a dataset using Python 💻 This helped me see how different tests are used in different scenarios. Instead of guessing, we can now select the right test and make data-driven decisions 📊🚀 #Statistics #HypothesisTesting #DataScience #DataAnalytics #LearningInPublic #Python
Like Comment
To view or add a comment, sign in
Ronak Jain
1w Edited
Report this post
I built a complete 𝗨𝘀𝗲𝗱 𝗖𝗮𝗿 𝗣𝗿𝗶𝗰𝗲 𝗣𝗿𝗲𝗱𝗶𝗰𝘁𝗼𝗿 from scratch, creating a full end-to-end pipeline that handles everything from raw data to a live application. Instead of relying on a pre-built dataset, I identified a unique problem and built my own data source using web scraping. My goal was to move beyond tutorials and mimic a real-world 𝗱𝗮𝘁𝗮 𝘀𝗰𝗶𝗲𝗻𝗰𝗲 workflow. • 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴: Automated data collection to get real-time market prices. • 𝗣𝗿𝗲𝗽𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴: Cleaning messy web data into a machine-learning-ready format. • 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴: Training a robust regressor to find the patterns. • 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁: Building a Flask web app to make the model accessible to anyone. The Workflow: 𝗦𝗰𝗿𝗮𝗽𝗲 𝗗𝗮𝘁𝗮 → 𝗖𝗹𝗲𝗮𝗻 & 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺 → 𝗧𝗿𝗮𝗶𝗻 𝗠𝗼𝗱𝗲l → 𝗗𝗲𝗽𝗹𝗼𝘆 #MachineLearning #DataScience #Python #Flask #WebScraping #PortfolioProject Check out the full documentation and code on GitHub: https://lnkd.in/gAZp4iKq
1 Comment
Like Comment
To view or add a comment, sign in
Nasiff Kazeem
2w
Report this post
🚀 Day 12 of #M4aceLearningChallenge Today, I dove deeper into NumPy, focusing on array indexing, slicing, and boolean masking — essential skills for efficient data manipulation. 🔍 Key Concepts Learned: ✅ Indexing in NumPy Arrays Just like Python lists, NumPy arrays can be indexed, but with more flexibility: import numpy as np arr = np.array([10, 20, 30, 40]) print(arr[0]) # Output: 10 ✅ Slicing Arrays Extracting subsets of data: print(arr[1:3]) # Output: [20 30] ✅ 2D Array Indexing arr2d = np.array([[1, 2, 3], [4, 5, 6]]) print(arr2d[0, 1]) # Output: 2 ✅ Boolean Masking (Powerful Feature 💡) Filtering data based on conditions: arr = np.array([10, 20, 30, 40]) filtered = arr[arr > 20] print(filtered) # Output: [30 40] 🧠 What I Found Interesting: Boolean masking makes it incredibly easy to filter datasets without writing complex loops — a huge advantage when working with large data. 💡 Real-World Relevance: These techniques are widely used in data cleaning, data analysis, and machine learning preprocessing. --- I’m getting more comfortable working with arrays and understanding how powerful NumPy can be in handling structured data efficiently. Looking forward to building more with this! 🚀 #M4aceLearningChallenge #DataScience #MachineLearning #Python #NumPy #LearningJourney
Like Comment
To view or add a comment, sign in
Abiodun Ismaeil AbdulRasaq
2w
Report this post
Day 12 of #M4aceLearningChallenge Today, I dove deeper into NumPy, focusing on array indexing, slicing, and boolean masking — essential skills for efficient data manipulation. 🔍 Key Concepts Learned: ✅ Indexing in NumPy Arrays Just like Python lists, NumPy arrays can be indexed, but with more flexibility: import numpy as np arr = np.array([10, 20, 30, 40]) print(arr[0]) # Output: 10 ✅ Slicing Arrays Extracting subsets of data: print(arr[1:3]) # Output: [20 30] ✅ 2D Array Indexing arr2d = np.array([[1, 2, 3], [4, 5, 6]]) print(arr2d[0, 1]) # Output: 2 ✅ Boolean Masking (Powerful Feature 💡) Filtering data based on conditions: arr = np.array([10, 20, 30, 40]) filtered = arr[arr > 20] print(filtered) # Output: [30 40] 🧠 What I Found Interesting: Boolean masking makes it incredibly easy to filter datasets without writing complex loops — a huge advantage when working with large data. 💡 Real-World Relevance: These techniques are widely used in data cleaning, data analysis, and machine learning preprocessing. #M4aceLearningChallenge #DataScience #MachineLearning #Python #NumPy #LearningJourney
Like Comment
To view or add a comment, sign in
Hazrat Bilal
1w
Report this post
📊 Recently explored 𝘆𝗱𝗮𝘁𝗮-𝗽𝗿𝗼𝗳𝗶𝗹𝗶𝗻𝗴 pandas library for Exploratory Data Analysis (EDA) and it’s a game changer! It provides a complete summary of the dataset with powerful visualizations, helping to quickly understand: 1️⃣ Dataset overview (structure, types) 2️⃣ Missing values detection 3️⃣ Distribution analysis 4️⃣ Correlation insights 5️⃣ Automatic visual reports 💡 One key takeaway: Before starting any data project, it’s highly valuable to review your dataset at least once using this report by ydata-profiling pandas library. It saves time, highlights hidden patterns, and improves decision-making. 🚀 Turning raw data into insights becomes much more efficient! #DataScience #EDA #Python #DataAnalysis #MachineLearning #LearningJourney

1 Comment
Like Comment
To view or add a comment, sign in

1,541 followers

View Profile Follow

Proper Data Splitting for Machine Learning Models

More from this author

SDG 3: Life Expectancy Prediction

Explore content categories

Proper Data Splitting for Machine Learning Models

More Relevant Posts

More from this author

SDG 3: Life Expectancy Prediction

Explore related topics

Explore content categories