Customer Churn Prediction with ML Pipeline Improvements

1mo

Pipelines in ML really change how you build models!! So I rebuilt my Customer Churn Prediction project — this time using a proper ML pipeline. 🔧 What I improved: • Built an end-to-end Pipeline using ColumnTransformer • Switched from train-test split to 5-Fold Cross Validation • Removed unnecessary feature selection (Chi-Square) • Handled class imbalance using F1-score & class_weight • Tuned models like Random Forest & XGBoost 📊 Key Results (F1 Score): • Logistic Regression → ~0.62 • Decision Tree → ~0.60 • Random Forest → ⭐ ~0.63 • XGBoost → ~0.56 💡 Key Learnings: • My earlier results were slightly optimistic due to a single train-test split • Cross-validation gave me more honest and stable performance • Random Forest performed best → indicating non-linear patterns • Logistic Regression performed almost as well → dataset isn’t highly complex • XGBoost underperformed → showing advanced models need proper tuning Check out both codes here:- https://lnkd.in/gV5Cb5iC This project helped me move from “just building models” to actually understanding how ML systems should be structured and evaluated in practice. Would love to hear your feedback or suggestions! #MachineLearning #DataScience #Python #ScikitLearn #XGBoost #Analytics #LearningJourney

To view or add a comment, sign in

More Relevant Posts

Vishal Sukale
2w
Report this post
Excited to share my latest Machine Learning Project! Food Delivery Time Prediction Using XGBoost Regressor with Hyperparameter Tuning Built an end-to-end ML Regression model to predict food delivery time (in minutes) based on real-world factors like distance, traffic, weather, ratings and vehicle type — using 1000 records! What I did: - Performed Feature Engineering — created 4 new meaningful features - Applied One-Hot Encoding (no data leakage) - Built baseline XGBoost Regressor (gradient boosting) - Tuned n_estimators, max_depth, learning_rate, subsample & colsample_bytree - Evaluated using MAE, MSE, RMSE and R² Score - Built Prediction Report showing Actual vs Predicted values Results: Before Tuning → R²: 93.89% | MAE: 3.770 | RMSE: 4.751 After Tuning → R²: 95.39% | MAE: 3.482 | RMSE: 4.129 Best Parameters Found: N Estimators: 200 | Max Depth: 3 | Learning Rate: 0.1 Subsample: 0.8 | Colsample Bytree: 0.8 Regression Model Comparison: Random Forest → R²: 94.08% AdaBoost → R²: 94.64% XGBoost → R²: 95.39% Best! Key Learnings: → XGBoost Gradient Boosting outperforms all regression models → colsample_bytree=0.8 & subsample=0.8 reduce overfitting → R² improved by +1.50% — strong improvement! → All 4 metrics improved after hyperparameter tuning → Average prediction error = only 3.48 minutes — production ready! → Feature Engineering significantly boosted model performance Tools: Python | XGBoost | Pandas | Scikit-learn | NumPy - Grateful for the guidance from Abhishek Jivrakh Sir during this project. GitHub Repository : https://lnkd.in/g-eRVMaR #MachineLearning #DataScience #Python #XGBoost #GradientBoosting #Regression #FeatureEngineering #FoodDelivery #MLProject #ScikitLearn #AI #HyperparameterTuning
Like Comment
To view or add a comment, sign in
AKASH KUMAR
3w
Report this post
𝐎𝐮𝐭𝐥𝐢𝐞𝐫𝐬 -- 𝐎𝐧𝐞 𝐩𝐫𝐨𝐛𝐥𝐞𝐦 𝐈 𝐤𝐞𝐞𝐩 𝐟𝐚𝐜𝐢𝐧𝐠 𝐰𝐡𝐢𝐥𝐞 𝐛𝐮𝐢𝐥𝐝𝐢𝐧𝐠 𝐦𝐨𝐝𝐞𝐥𝐬… While working on a recent dataset before model building, I ran into a common issue ---- outliers. We all know: "Outliers are unusual data points that behave very differently from the rest of the data." But what I realized practically is: Outliers are not always “bad”. 𝐖𝐡𝐞𝐫𝐞 𝐨𝐮𝐭𝐥𝐢𝐞𝐫𝐬 𝐜𝐫𝐞𝐚𝐭𝐞 𝐩𝐫𝐨𝐛𝐥𝐞𝐦𝐬 Some ML algorithms are sensitive to outliers: 1. Linear Regression 2. Logistic Regression 3. AdaBoost 4. Deep Learning models These models can get biased because a few extreme values pull the learning in the wrong direction. 𝐁𝐮𝐭 𝐬𝐨𝐦𝐞𝐭𝐢𝐦𝐞𝐬 𝐰𝐞 𝐍𝐄𝐄𝐃 𝐨𝐮𝐭𝐥𝐢𝐞𝐫𝐬 Example: Fraud Detection Fraud transactions = outliers Removing them = removing the actual problem So decision depends on business context, not just data. 𝐇𝐨𝐰 𝐈 𝐡𝐚𝐧𝐝𝐥𝐞𝐝 𝐨𝐮𝐭𝐥𝐢𝐞𝐫𝐬 𝐢𝐧 𝐦𝐲 𝐰𝐨𝐫𝐤𝐟𝐥𝐨𝐰 There are mainly two approaches: 1. Trimming (Removing Outliers) --> Completely removing extreme values 2. Capping (Winsorization) --> Limiting values to a threshold instead of removing Method depends on distribution 1. 𝐍𝐨𝐫𝐦𝐚𝐥 𝐃𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧 --> 𝐙-𝐒𝐜𝐨𝐫𝐞 Rule: Mean ± 3 * Standard Deviation 2. 𝐒𝐤𝐞𝐰𝐞𝐝 𝐃𝐚𝐭𝐚 --> 𝐈𝐐𝐑 𝐌𝐞𝐭𝐡𝐨𝐝 Outliers are not just noise They can be signal depending on the problem #datascience #machinelearning #modelbuilding #outlier #python #Statistics #dataanalyst
Like Comment
To view or add a comment, sign in
Vishal Sukale
3w Edited
Report this post
Excited to share my latest Machine Learning Project! Real Estate Price Prediction Using Decision Tree Regressor with Hyperparameter Tuning Built an end-to-end ML Regression model to predict house prices (in USD) based on features like size, bedrooms, age, location, type, condition and furnishing using a real estate dataset of 750 records. What I did: - Performed One-Hot Encoding before Train-Test Split (no data leakage) - Built a baseline Decision Tree Regressor - Applied GridSearchCV with expanded parameter grid - Used max_features tuning for better generalization - Evaluated using MAE, MSE, RMSE and R² Score Results: Before Tuning → R²: 97.04% | MAE: 28,319 | RMSE: 35,556 After Tuning → R²: 97.45% | MAE: 26,279 | RMSE: 32,988 Best Parameters Found: Criterion: friedman_mse | Max Depth: None | Max Features: None Min Samples Leaf: 3 | Min Samples Split: 2 Key Learnings: → Correct ML pipeline prevents data leakage → Expanding param grid improves tuning results → R² above 97% = outstanding regression model → All 4 metrics improved after hyperparameter tuning → Decision Tree Regressor can match complex models on clean data This project gave me deep hands-on experience in regression modeling, feature encoding and hyperparameter optimization! Tools Used: Python | Pandas | Scikit-learn | NumPy - Grateful for the guidance from Abhishek Jivrakh Sir during this project. Github repository : https://lnkd.in/gsAPDMrW #MachineLearning #DataScience #Python #DecisionTree #Regression #RealEstate #GridSearchCV #HyperparameterTuning #MLProject #ScikitLearn #AI #DataAnalysis
Like Comment
To view or add a comment, sign in
VIVEK KUMAR
3d
Report this post
#Hello_Connection..... 🚀 Just completed my Heart Stroke Prediction Web App using Machine Learning! I built this project to understand the complete ML pipeline — from data analysis to deployment. 🔍 What I did in this project: Performed EDA (Exploratory Data Analysis) Cleaned and preprocessed the dataset Trained multiple models: Logistic Regression, KNN, Decision Tree, SVM, Naive Bayes Selected Logistic Regression as the final model based on best accuracy Deployed the model using Streamlit 💡 The app takes user health inputs like age, cholesterol, blood pressure, etc., and predicts whether the person is at: ✔️ Low Risk ❌ High Risk of Heart Disease 🛠️ Tech Used: Python | Pandas | Scikit-learn | Streamlit | Joblib This project really helped me understand how ML models are used in real-world applications. 📌 GitHub Project: https://lnkd.in/g5ZfpgYA I’d love to hear your feedback and suggestions! You can check: 👇 https://lnkd.in/gwZW_AT4 #MachineLearning #DataScience #Python #Streamlit #AI #Projects #LearningJourney

2 Comments
Like Comment
To view or add a comment, sign in
Gowtham SB
1w
Report this post
MODEL TUNING — PRACTICAL CHEAT SHEET (STEP-BY-STEP) 1. Start with Baseline Model Train with default settings Goal: Get a number to beat model = RandomForestClassifier() model.fit(X_train, y_train) 2. Evaluate Performance Check your current level accuracy_score(y_test, model.predict(X_test)) 3. Focus on Important Parameters Do not tune everything For RandomForest: - n_estimators - max_depth - min_samples_split Rule: Tune only 2–3 parameters 4. Tune One Parameter at a Time for n in [50, 100, 200]: model = RandomForestClassifier(n_estimators=n) model.fit(X_train, y_train) Pick the value that gives best result 5. Use Grid Search (Thorough) GridSearchCV(model, params, cv=5) Use when dataset is small or medium 6. Use Random Search (Faster) RandomizedSearchCV(model, params, n_iter=10) Use when dataset is large 7. Check Overfitting If train score >> test score Fix: - Reduce max_depth - Increase min_samples_split - Use cross-validation Goal: Train score ≈ Test score 8. Use Cross Validation cross_val_score(model, X, y, cv=5) Use 5-fold or 10-fold 9. Check Feature Importance model.feature_importances_ Remove low importance features 10. Save Best Model joblib.dump(model, "best_model.pkl") QUICK CHECKLIST - Start simple - Tune 2–3 parameters only - Use cross validation - Compare before vs after tuning - Stop if improvement is less than 1 percent - Data quality is more important than tuning COMMON METRICS Classification: - Accuracy - Precision - Recall - F1 Score Regression: - MAE - RMSE - R2 GOLDEN RULE Try -> Measure -> Adjust -> Repeat Small changes lead to better results #MachineLearning #ModelTuning #HyperparameterTuning #DataScience #Python #ScikitLearn #MLOps #AI #DataEngineering #Analytics #LearnML #MLBasics #TechCareers #Coding #AIProjects
2 Comments
Like Comment
To view or add a comment, sign in
Vishal Sukale
2w
Report this post
Excited to share my latest Machine Learning Project! Income Prediction Using XGBoost Classifier with Hyperparameter Tuning Built an end-to-end ML Classification model to predict whether an individual earns more or less than $50K/year using the Adult Census Dataset (32,561 records). What I did: - Handled missing values encoded as ' ?' in the dataset - Applied One-Hot Encoding with drop_first=True (no multicollinearity) - Built baseline XGBoost Classifier (gradient boosting) - Tuned n_estimators, max_depth, learning_rate, subsample & colsample_bytree - Evaluated using Accuracy, Precision, Recall, F1 & Classification Report Results: Before Tuning → Accuracy: 87.25% | Precision: 0.777 | F1: 0.705 After Tuning → Accuracy: 87.67% | Precision: 0.796 | F1: 0.711 Best Parameters Found: N Estimators: 200 | Max Depth: 5 | Learning Rate: 0.1 Subsample: 1.0 | Colsample Bytree: 0.8 Model Comparison across my portfolio: Decision Tree → ~85.00% Random Forest → ~86.17% AdaBoost → ~86.40% XGBoost → 87.67% Best! Key Learnings: → XGBoost uses Gradient Boosting — most powerful boosting algorithm → colsample_bytree=0.8 reduces overfitting by using 80% features per tree → XGBoost outperformed all 4 ensemble models in my portfolio → Weighted avg F1 Score of 0.87 — highest in portfolio → GridSearchCV with 5-fold CV found the best combination Tools: Python | XGBoost | Pandas | Scikit-learn | NumPy - Grateful for the guidance from Abhishek Jivrakh Sir during this project. GitHub Repository : https://lnkd.in/gj6EpCnc #MachineLearning #DataScience #Python #XGBoost #GradientBoosting #EnsembleLearning #MLProject #ScikitLearn #AI #HyperparameterTuning #Classification #DataAnalysis
Like Comment
To view or add a comment, sign in
Aditya Hale
1w
Report this post
Logistic Regression: From Lines to Logic! 📊 Have you ever wondered how machines make "Yes" or "No" decisions? Whether it's spotting spam emails or predicting if a customer will subscribe, Logistic Regression is the go-to tool! 🛠️ Here is a simple 3-step breakdown of how it works: 1️⃣ Linear Prediction: We start with a basic line (y = mx + b). But since a line can go to infinity, it doesn't give us a clear "yes/no" answer. 2️⃣ The Sigmoid "Magic": We pass that line through the Sigmoid Function. This acts like a "squasher," taking any number and squeezing it between 0 and 1. 🔄 3️⃣ Binary Output: Now we have a probability! 📈 Above 0.5? It's a 1 (Yes!). Below 0.5? It's a 0 (No!). It’s simple, powerful, and the foundation of many classification tasks in Data Science. 💡 What’s your favorite classification algorithm? Let’s discuss below! 👇 #DataScience #MachineLearning #Python #LogisticRegression #AI #LearningJourney #DataAnalytics
Like Comment
To view or add a comment, sign in
Vikku Kumar
2w
Report this post
🚀 Car Price prediction project using ML – Part 2: Model Building & Evaluation Continuing from yesterday’s update on Data Cleaning & ETL, today I focused on the next critical phase of the ML pipeline 👇 🔹 Data Splitting Divided the dataset into training and testing sets to ensure unbiased model evaluation. 🔹 Model Training Experimented with multiple algorithms: • Linear Regression • Random Forest Regressor • XGBoost Regressor 🔹 Hyperparameter Tuning Applied **GridSearchCV** to optimize model performance and find the best parameters. 🔹 Results & Insights After comparing all models, Linear Regression performed the best for this dataset—simple, interpretable, and effective. 💡 Key takeaway: Sometimes simpler models outperform complex ones depending on the data. Looking forward to taking this further with model evaluation metrics and deployment 🚀 #MachineLearning #DataScience #MLProjects #Python #AI #LearningInPublic
Like Comment
To view or add a comment, sign in
Moeez Nisar
1w
Report this post
Stock Price Prediction Using SVM | Machine Learning Project 📈 I’m excited to share my latest project where I built a Stock Price Prediction model using Python and Scikit-Learn! Stock markets are notoriously volatile, making them a perfect challenge for Data Science. In this project, I leveraged Support Vector Regression (SVR) to analyze and predict price movements. Key Technical Highlights: Feature Engineering: Used Pandas for date-indexing and created lagged price values to capture time-series trends. Model Optimization: Implemented GridSearchCV to fine-tune hyperparameters ($C$, $\gamma$, and kernels), significantly boosting the model's accuracy. Data Scaling: Applied StandardScaler to normalize input features for better SVR performance. Visualization: Used Matplotlib to plot "Actual vs. Predicted" prices, making the results easy to interpret. Results: The tuned SVR model successfully captured the market trends with a very low Error Rate (RMSE), demonstrating the effectiveness of SVMs in financial forecasting. Check out the video below to see the full workflow and results! 🎥👇 #MachineLearning #DataScience #Python #SVM #StockMarket #AI #PredictiveAnalytics #ScikitLearn
Like Comment
To view or add a comment, sign in
Shubham Sharma
2w
Report this post
Exciting Update: My First MLflow Experiment on Credit Card Fraud Detection! I’ve just completed my first end-to-end machine learning experiment using MLflow for a Credit Card Fraud Detection project. Here’s what I did: I compared 4 different models: - Decision Tree - Random Forest - XGBoost - LightGBM I used metrics like accuracy, precision, recall, F1 score, and Cohen’s Kappa to evaluate their performance on a real-world dataset. Key Takeaways: - MLflow helped me track experiments, log parameters and metrics, and easily compare models. - The models had varied performances, and the choice of evaluation metric (especially recall for fraud detection) made a big difference in model selection. - I used techniques like class weighting and scale_pos_weight to handle the imbalanced dataset effectively. Tech Stack: - Python - MLflow for experiment tracking - Scikit-learn, XGBoost, and LightGBM for model building - Pandas, NumPy for data manipulation - Matplotlib, Seaborn for data visualization If you’re working on any ML project, I highly recommend giving MLflow a try—it’s a great tool for managing experiments and improving reproducibility. #MachineLearning #MLflow #DataScience #CreditCardFraudDetection #AI #Python #ScikitLearn #XGBoost #LightGBM #ModelEvaluation #MLOps #DataScienceCommunity #MachineLearningExperiment
Like Comment
To view or add a comment, sign in

5,525 followers

37 Posts

View Profile Connect

Customer Churn Prediction with ML Pipeline Improvements

More Relevant Posts

Explore related topics

Explore content categories