Jeorge Silva’s Post

3mo

10 models, 1 loop, and a lot of learning 🚀 One of the most fascinating takeaways from my Data Science journey so far is that there’s no such thing as a "silver bullet." An algorithm that shines in one scenario might fail miserably in another. Today, I decided to automate my benchmarking process. Instead of manually testing algorithms one by one, I built a Python workflow that pre-processes the data and evaluates 10 different models at once using Cross-Validation. 💡 Key learnings from this experiment: The power of Pipelines: It keeps the code clean and ensures pre-processing steps (like KNNImputer or MinMaxScaler) are locked to the model, preventing data leakage. Interpretation matters: Seeing a negative score for Lasso while Random Forest hit 0.92+ gave me immediate insight into the nature of my dataset (likely highly non-linear). Efficiency: Automating repetitive tasks frees up time for the actual analysis and tuning. Seeing that final list of scores print out brings a huge sense of satisfaction! On to the next steps. 📈 Question for the network: Do you usually test a wide range of models in the initial phase, or do you skip straight to the heavy hitters (like XGBoost/LightGBM)? 👇 #DataScience #MachineLearning #Python #ScikitLearn #Coding #LearningJourney

4 Comments

Djalal Eddine Bouteldja 3mo

The pipeline is solod, however I'd suggest removing ridge and lasso as they are regressors by default being used in a classification setting, (they likely tried to regress on class labels 0/1)

1 Reaction

Longhow Lam 📓 3mo

You could use autogluon to this with "one line of code" :-)

See more comments

To view or add a comment, sign in

More Relevant Posts

Rashmi .
2mo
Report this post
📊 Why Do We Split Data in Machine Learning? One of the most important steps in building a reliable ML model is splitting the dataset correctly. Here’s the common approach: 🔵 Training Set (~70%) Used to train the model. The model learns patterns from this data. 🟡 Validation Set (~15%) Used to tune hyperparameters. Helps in improving performance and avoiding overfitting. 🟠 Test Set (~15%) Used only for final evaluation. It checks how well the model performs on unseen data. 💡 Why not train on 100% data? Because a model that performs well only on training data but fails on new data is not useful. Proper data splitting ensures: ✅ Better generalization ✅ Reduced overfitting ✅ Reliable performance evaluation Machine Learning isn’t just about building models — it’s about building models that work in the real world. Day ___ of my ML learning journey 🚀 #MachineLearning #DataScience #TrainTestSplit #MLJourney #Python
Like Comment
To view or add a comment, sign in
Harshal Khond
2mo
Report this post
# Day 2 of my Machine Learning learning-in-public journey # Topic: Train Data vs Test Data One of the most important concepts in Machine Learning is splitting data. Why don’t we train a model on all the data? Because a good model should not just memorize data — it should generalize to new, unseen data. 1) Training data • Used to teach the model • The model learns patterns from this data 2) Testing data • Used to evaluate the model • Helps us understand real-world performance # Key takeaway If a model performs very well on training data but poorly on test data, it usually means the model is overfitting. 🤔 Question for you What do you think will happen if we train a model using 100% of the data? 👇 I’ll share my answer in the comments. #MachineLearning #DataScience #LearningInPublic #MLBasics #Python

1 Comment
Like Comment
To view or add a comment, sign in
Kannan Ranganathan
2mo
Report this post
🧠 Today's ML micro-lesson: Overfitting vs Underfitting Two of the most important concepts in machine learning — and they're easier to grasp than you'd think. 🔴 Underfitting = your model is too simple. It can't even learn the training data, let alone generalize. (Think: trying to fit a curve with a flat line.) 🟢 Overfitting = your model memorized the training data. It performs great in practice... until it sees new data. (Think: cramming exact answers instead of understanding.) ✅ The goal? A model that generalizes. I tested this today in Python using sklearn's Decision Tree on the Iris dataset: • depth=1 → Underfit (train=67%, test=64%) • depth=None → Overfit (train=100%, test=95%) • depth=3 → Balanced (train=98%, test=97%) ✓ The key insight: more complexity ≠ better model. Regularization and cross-validation are your friends. 20 minutes of focused learning > 2 hours of passive scrolling. What ML concept did you learn recently? 👇 #MachineLearning #Python #DataScience #LearningInPublic #sklearn 😊
Like Comment
To view or add a comment, sign in
Srinivasa Reddy Chagamreddy
2mo
Report this post
🚀 I recently worked on implementing the K-Nearest Neighbors (KNN) algorithm and evaluated how well it predicts unseen data. Instead of focusing only on theory, I wanted to understand what actually happens when a model learns from data. First, I prepared the dataset and applied feature scaling because KNN depends on distance. Then I trained the model and tested it on new data. Results: • ✅ Training Accuracy — 95.83% • ✅ Testing Accuracy — 96.66% Since both accuracies are almost equal, the model is not memorizing the dataset. It is actually identifying patterns and making reliable predictions. What I learned • Distance plays a crucial role in prediction • Scaling directly affects model performance • High accuracy alone is not enough — comparison matters • Simple algorithms can still be very powerful This project helped me understand the difference between a model that learns and a model that just fits data. 👉 Building strong fundamentals with simple algorithms is important in Machine Learning. #MachineLearning #DataScience #KNN #Python #LearningByDoing
Like Comment
To view or add a comment, sign in
Joachim Schork
2mo
Report this post
When variables have very different scales, many algorithms can become biased toward features with larger ranges. Feature scaling ensures all variables contribute equally by putting them on comparable scales, making your analysis or model more balanced and interpretable. Why use feature scaling: ✔️ Equal contribution: Prevents large-scale variables from dominating smaller ones. ✔️ Improved model accuracy: Algorithms that rely on distances or gradients (e.g., k-means, PCA, SVM) perform better with scaled data. ✔️ Faster convergence: Helps optimization algorithms reach stable solutions more efficiently. ✔️ Consistent interpretation: Makes results across features easier to compare and understand. For example, in k-means clustering, scaling has a major impact. The visualization below shows how z-score normalization corrects clustering bias when features have different scales. Without normalization, the clusters align mainly along the x-axis, which has the largest variation. After normalization, the true cluster structure is recovered correctly. Image source Wikipedia: https://lnkd.in/ewd4yaUu Feature scaling in R and Python: 🔹 R: scale(), caret::preProcess(), scales::rescale(), or recipes::step_normalize() 🔹 Python: StandardScaler, MinMaxScaler, or RobustScaler from sklearn.preprocessing Stay sharp in R, Python, and data science. Subscribe to my newsletter for new tutorials and examples. Check out this link for more details: https://lnkd.in/d9E78HvR #rprogramminglanguage #datastructure #statistical
Like Comment
To view or add a comment, sign in
Virendra Kumar
2mo
Report this post
🚀 Starting My Machine Learning Journey — Days 1–3 I’ve officially begun my transition into Machine Learning, focusing on strong fundamentals before jumping into models. 📅 Progress so far: 🔹 Day 1 – Python Foundations • Understanding data types and variables • Writing clean logic using loops & conditions • Problem-solving mindset instead of memorizing syntax 🔹 Day 2 – Strings & Logical Thinking • Important string methods used in data cleaning • Mini coding exercises • Learning how small operations matter in preprocessing 🔹 Day 3 – NumPy (Entering the ML World) • Arrays vs Lists • Vectorization concept (core of ML performance) • Matrix indexing & slicing • Mean, max, min, std calculations • Reshaping data for model input 💡 Biggest realization: Machine Learning is less about “algorithms” and more about how well you understand and prepare data. Next step → Working with real datasets using Pandas. #MachineLearning #Python #NumPy #LearningInPublic #AIJourney
Like Comment
To view or add a comment, sign in
Yash Mathur
2mo
Report this post
🚀 Excited to share my Machine Learning Project! 📧 Email Spam Detection System (Machine Learning + GUI) I built a Spam Detection System using Python and Scikit-Learn that classifies emails as Spam or Not Spam with high accuracy. 🔍 Project Highlights: ✔ Text preprocessing using TF-IDF Vectorization ✔ Multinomial Naive Bayes Algorithm ✔ Trained on SMS Spam Dataset ✔ Desktop GUI built using Tkinter ✔ Model accuracy ~96% This project helped me understand: • Text preprocessing techniques • Feature extraction (TF-IDF) • Naive Bayes classification • Model saving & loading using Joblib • Building real-world ML applications with GUI 🔗 GitHub Repository: https://lnkd.in/gMevvdWE I’m continuously learning and building real-world Machine Learning projects. #MachineLearning #Python #ArtificialIntelligence #GitHub

1 Comment
Like Comment
To view or add a comment, sign in
Nikhil Gopi
2mo
Report this post
📉 What Overfitting Taught Me One thing I’ve learned while working on machine learning projects: High accuracy on training data doesn’t mean the model will perform well in the real world. I’ve seen models look impressive at first, only to drop in performance after proper train–test splitting and cross-validation. That’s when overfitting becomes obvious. Now I focus on: • Proper validation • Bias–variance balance • Model interpretability • Performance on unseen data Data work isn’t about chasing the highest score. It’s about building models that generalize. #MachineLearning #DataAnalytics #Python #Learning #JuniorDataAnalyst
Like Comment
To view or add a comment, sign in
David Quayefio
2mo
Report this post
🌳 Decision Trees: A Complete ML Tutorial (Classification + Regression) Just completed a hands-on tutorial exploring Decision Trees — one of the most interpretable algorithms in machine learning. What's inside: Classification(Iris Dataset) → Achieved 96.7% accuracy with just a 2-level tree → Petal width alone separates Setosa perfectly → Proof that simple models can be powerful 📈 Regression (Diabetes Dataset) → Predicting disease progression from clinical data → R² = 0.26 — honest about limitations → Why ensemble methods are the next step Key Takeaways: 1. Trees excel when data has clear patterns 2. Hyperparameter tuning (GridSearchCV) is essential 3. Interpretability is a superpower — you can explain every decision 4. Single trees have limits; know when to scale to Random Forest/XGBoost The PDF includes full tree visualizations, confusion matrices, feature importance rankings, and a side-by-side comparison of both approaches. Thank you, Krish Naik, for making these learning resources available. 🛠️ Tools: Python, scikit-learn, pandas, NumPy, Matplotlib #MachineLearning #DataScience #DecisionTrees #Python #ScikitLearn #MLTutorial #LearningInPublic

1 Comment
Like Comment
To view or add a comment, sign in
Kareem Albaghdadi
2mo
Report this post
Lately, I’ve been taking time to refresh and strengthen my knowledge and one thing is clearer than ever: business and technology are deeply connected. They can’t be separated. Technology is not just about writing code. It’s about creating impact. Python is a powerful language that goes far beyond development. It enables automation, advanced data analysis, and intelligent systems that help businesses reach new milestones. Recently, I’ve been training and testing machine learning models, and it’s inspiring to see how raw data can turn into insights, predictions, and smarter decisions. The more I grow technically, the more I understand how important it is to think from both perspectives: developer and business. If you’re exploring this field, I encourage you to dive deeper into powerful Python libraries like scikit-learn, XGBoost, matplotlib and many others that can elevate your machine learning projects. Continuous learning. Continuous improvement. 🚀 #Python #MachineLearning #BusinessAndTechnology #DataScience #Innovation #ContinuousLearning
Like Comment
To view or add a comment, sign in

440 followers

31 Posts

View Profile Connect

Jeorge Silva’s Post

More Relevant Posts

Explore related topics

Explore content categories