Model Adequacy Checking - Project OBJECTIVE The goal of this project is to use the Mean Absolute Error (MAE), Mean Squared Error (MSE) and Coefficient of determination (r2) to check for the best suitable measure for model adequacy checking. BACKGROUND OF THE STUDY We’ve seen some Data Scientist using MSE and MAE for model adequacy checking and it is really wrong because, it only shows us the deviation of the estimated value from their true value in their squares & absolute values respectively, we know the closer it is to zero the less variation in the dataset, yet there is no single/range value(s) that serve(s) as the yardstick on which to base our objective decision from; Hence, the Coefficient of Determination (r2-score) which measures the variation of the Target variable as explained (caused) by the feature variables, if it has value greater than ( ≥ ) 80%, it is a good fit for model adequacy and can be used for the goodness of fit test for the dataset understudy. DATASET INFORMATION Rows (observations): 200 Columns (features): 3 Target variable: Sales Features include: TV, Radio & Newspaper STEPS PERFORMED Data Cleaning Model Trained: Linear Regression Data set plot: Radio Ad spend Vs Sales Adequacy Evaluation: MSE = 27.6, MAE = 4.6 & r2-score achieved the performance of 10.70. CONCLUSION From the Scatter plot, we saw weak relationship between Radio Advertisement spending and Sales and it was hugely supported by r2-score that showed only 10.7% relationship between Radio ad spending and Sales, it indicates that only 10.7% of the sales was attributed to Radio Ad. I’m happy to share this Model Adequacy Checking project I worked on. Check it out here: https://lnkd.in/d4eVzhiv #DataScience #Utiva #Python #MachineLearning
Model Adequacy Checking with MAE, MSE, and r2-score
More Relevant Posts
-
Common Flaw in Principal Component Analysis - PCA ‼️ How many times do we need to remind people that PCA plots without the percentage of variance explained by each axis are essentially meaningless? The % of variances explained defines the scope of the plot. Always report % variance for each axis. I have mentioned this multiple times, yet it continues to be disregarded. If someone does not grasp how PCA works, they should refrain from posting on it. ‼️ Other Common Flaws ‼️ 💥 Thinking that PCA is always relevant 💥 Applying PCA to categorical or count data 💥 Overinterpreting separation in PCA plots 💥 Ignoring outliers 💥 Using PCA on variables that are not linearly related 💥 Interpreting loadings incorrectly 💥 Forgetting that PCA is unsupervised 💥 Interpreting too few or too many PCs 💥 Incorrect interpretation of biplots 💥 I will stop here for the moment, it is Sunday after all... If you are interested, I can expand on these points in a follow-up post. Let me know. #understanding #PCA #plots #statistics
PCA Loading Plots are powerful tools for visualizing how each variable contributes to the principal components in a data set. Principal Component Analysis (PCA) simplifies complex data by transforming it into key patterns and relationships. Loading plots help identify key variables and understand their relationships, enhancing data interpretation and analysis. ✔️ Loading Plots Explained: Loading plots are an essential tool within PCA that help you understand the contributions of each original variable to the principal components. They allow you to see which variables are most influential in your data set, providing insights into the data’s structure and relationships. This can be particularly useful for feature selection, data interpretation, and improving model performance. A Loading Plot displays the variables as vectors (arrows) within the principal component space. The direction of each arrow indicates the influence of the variable on the principal components, while the length of the arrow represents the strength of its contribution. By examining these plots, you can quickly identify which variables are driving the patterns in your data. ✔️ Interpreting Loading Plots: When interpreting loading plots, pay attention to the direction and length of the arrows. Arrows pointing in the same direction indicate a positive correlation, whereas arrows pointing in opposite directions indicate a negative correlation. Longer arrows signify variables with a stronger influence on the principal components. This visualization helps in understanding the role of each variable and their interrelationships. I have created several tutorials on PCA loading plots in collaboration with Paula Villasante Soriano and Cansu Kebabci: Loading Plot explained: https://lnkd.in/esR5YGH7 What are Loadings: https://lnkd.in/eMUHYcKj Loading Plot in R: https://lnkd.in/eD3Kg8CX Loading Plot in R (Video ft. Albert Rapp): https://lnkd.in/e6XxhNsH Loading Plot in Python: https://lnkd.in/eG9BEyD4 I have also created an extensive online course on PCA, covering both theoretical concepts and practical applications in R programming. See this link for additional information: https://lnkd.in/eUnAqErz #python #datavisualization #datascientists
To view or add a comment, sign in
-
-
Now, I just trained a Logistic Regression model on the Iris dataset! Quite Interesting 🤔 Steps I followed: •Loaded & explored data •Split into train/test sets •Scaled features •Trained & predicted 1. Import libraries import pandas as pd from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score 2. Load dataset iris = load_iris() 3.Create combined DataFrame df = pd.DataFrame(data=iris.data, columns=iris.feature_names) df["target_class"] = iris.target 4. Split input (X) and output (y) X = df[iris.feature_names] y = df["target_class"] 5. Split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 6. Scale the data scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) 7. Train logistic regression model model = LogisticRegression() model.fit(X_train_scaled, y_train) 8. Predict and check accuracy y_pred = model.predict(X_test_scaled) acc = accuracy_score(y_test, y_pred) print(f"Model Accuracy: {acc * 100:.2f}%") Logistic Regression predicts probabilities for categories, not logistics! #MachineLearning #DataScience #Python #LogisticRegression #IrisDataset #LMES #UPTOR #MohanSivaraman
To view or add a comment, sign in
-
When building regression models, watch out for significant predictors! 🚨 Sometimes, variables that seem important might lose their significance when the model gets better. Here's why you should be cautious: ⚠️ Model Improvement: As you refine your model by adding more data or adjusting parameters, the significance of predictors can change. ⚠️ Multicollinearity: Variables might appear significant individually but lose their importance when considered alongside others due to multicollinearity. ⚠️ Overfitting: Beware of overfitting, where the model fits too closely to the training data, making it less accurate with new data. ⚠️ Data Quality: Ensure your data sets are clean and representative to avoid misleading results. ⚠️ Consider Context: Understand the context of your analysis. A variable may be significant in one scenario but not in another. Consider the models shown in the graph below: Initially, the predictor variable "life" appeared to be significant in a simpler model. However, the p-value increased as additional variables were incorporated, resulting in the variable no longer being classified significant in the expanded model. Remember, interpreting regression results requires careful consideration of various factors. Always validate your findings and be open to adjusting your model for better accuracy! I recently hosted a webinar titled "Data Analysis & Visualization in R," where I covered various topics, including regression model comparison. I’ve developed a mini-course based on this live webinar, where I offer the live session recording, exercises with solutions, and additional resources. For more information, visit this link: https://lnkd.in/dr9xU8kD #DataViz #datavis #Python #RStats #DataScience #database #Python3 #R
To view or add a comment, sign in
-
-
Binary logistic regression is a powerful statistical method used to model the relationship between a binary target variable and one or more predictor variables. It's commonly used in situations where the outcome is categorical, such as predicting whether a customer will buy a product (yes/no) or whether a patient has a disease (present/absent). When properly implemented, binary logistic regression offers several benefits: ✔️ Accurate Predictions: It helps in making precise predictions about binary outcomes, which can be crucial for decision-making in fields like marketing, healthcare, and finance. ✔️ Variable Impact: By examining the coefficients, you can understand the impact of different variables on the probability of the outcome. ✔️ Flexibility: Logistic regression can handle multiple predictor variables, making it suitable for complex models. However, if not handled correctly, there can be drawbacks: ❌ Overfitting: Using too many predictor variables can cause the model to become overly complex and perform poorly on new data. ❌ Misinterpretation: The model's output probabilities need careful interpretation, as incorrect conclusions can lead to faulty decisions. ❌ Assumption Dependence: Logistic regression relies on certain assumptions, such as the linear relationship between predictors and the log odds of the target variable. Violating these assumptions can reduce model reliability. To implement binary logistic regression in practice, you can use these tools: 🔹 R: Use the glm() function from the base package to fit a logistic regression model. The ggplot2 package can be used to visualize the data and model predictions. 🔹 Python: Use LogisticRegression from the scikit-learn library to create a logistic regression model. Libraries like matplotlib or seaborn can help visualize the results. The visualization of this post demonstrates a logistic regression model, showcasing how the model predicts probabilities for a binary outcome based on a continuous predictor. It includes a curve that represents the predicted probabilities, helping us understand how changes in the predictor variable affect the likelihood of different outcomes. If you want to dive deeper into this topic and learn how to apply these techniques in R, check out my online course on Statistical Methods in R. It covers binary logistic regression and many other related topics in detail. See this link for additional information: https://lnkd.in/d-UAgcYf #research #rprogramminglanguage #datavisualization #ggplot2 #dataanalytic #bigdata #datavisualization #dataanalytics #tidyverse
To view or add a comment, sign in
-
-
Mean imputation is a straightforward method for handling missing values in numerical data, but it can significantly distort the relationships between variables. By replacing missing values with the mean of observed data, this approach artificially reduces variability and weakens correlations, leading to misleading results in analysis. Why Does Mean Imputation Distort Correlations? ❌ No variability in imputed values: Mean imputation assigns the same value to all missing entries, failing to reflect the natural variability of the data. ❌ Weakens relationships: The imputed values introduce artificial uniformity that diminishes or masks the strength of correlations between variables. ❌ Biased downstream analyses: Statistical tests and predictive models relying on the data's correlation structure may produce inaccurate or unreliable results. A Visual Example: The attached image demonstrates how mean imputation can disrupt correlations between variables. The black points represent the original observed values, showing the natural relationship between variables X1 and X2. The red and green points represent imputed values for X1 and X2, respectively, placed at their mean values. This disrupts the overall pattern, artificially aligning the data along the mean and weakening the true correlation between X1 and X2. A Better Approach: To preserve relationships between variables, predictive mean matching is a superior alternative. This method selects observed values closest to the predicted value for a missing entry, maintaining variability and the natural correlation structure. When combined with multiple imputation, it also accounts for uncertainty, ensuring more robust and reliable results for downstream analyses. For a detailed explanation of mean imputation, its drawbacks, and better alternatives, check out my full tutorial here: https://lnkd.in/d2vfiSmf Sign up for my free email newsletter to stay informed about data science, statistics, Python, and R. More info: http://eepurl.com/gH6myT #RStats #Data #datasciencetraining #Python #StatisticalAnalysis #DataAnalytics
To view or add a comment, sign in
-
Quantile regression is a valuable tool for analyzing the relationship between variables, especially when data is not evenly distributed or has outliers. Unlike traditional linear regression, which focuses only on the mean, quantile regression allows us to predict different points across the distribution of the target variable. Challenges: ❌ Compared to linear regression, quantile regression requires more computational power and can be harder to interpret for non-experts. ❌ Larger sample sizes might be needed to achieve stable and reliable quantile estimates, especially for extreme percentiles. ❌ The model's results might be less intuitive if you are accustomed to traditional regression techniques, which could limit ease of communication. Advantages: ✔️ Quantile regression helps to explore trends at various quantiles, offering a more detailed picture of your data. ✔️ This method is highly effective for non-normal data, particularly when there are outliers or heavy tails. ✔️ It is ideal for situations where extreme values or various percentiles are as important as the central trend. How to handle quantile regression in practice: 🔹 R: Use the quantreg package to apply quantile regression. The rq() function allows you to specify the quantiles you're interested in. 🔹 Python: In Python, statsmodels provides quantile regression with the QuantReg() function to analyze different percentiles of your data. The attached visualization is based on a Wikipedia image (link: https://lnkd.in/e7eYbpPg) and illustrates quantile regression lines at various percentiles, showing how predicted values differ across the distribution. To explain this topic in further detail, I collaborated with Micha Gengenbach to create a comprehensive tutorial: https://lnkd.in/eyb_DFr8 Curious to learn more about statistics and R programming? Join my online course, "Statistical Methods in R." See this link for additional information: https://lnkd.in/d-UAgcYf #statistical #analysis #database
To view or add a comment, sign in
-
-
Regression — a valuable resource for understanding data beyond the mean. A must-read for faculty and students aiming to deepen their grasp of advanced statistical modeling.
Quantile regression is a valuable tool for analyzing the relationship between variables, especially when data is not evenly distributed or has outliers. Unlike traditional linear regression, which focuses only on the mean, quantile regression allows us to predict different points across the distribution of the target variable. Challenges: ❌ Compared to linear regression, quantile regression requires more computational power and can be harder to interpret for non-experts. ❌ Larger sample sizes might be needed to achieve stable and reliable quantile estimates, especially for extreme percentiles. ❌ The model's results might be less intuitive if you are accustomed to traditional regression techniques, which could limit ease of communication. Advantages: ✔️ Quantile regression helps to explore trends at various quantiles, offering a more detailed picture of your data. ✔️ This method is highly effective for non-normal data, particularly when there are outliers or heavy tails. ✔️ It is ideal for situations where extreme values or various percentiles are as important as the central trend. How to handle quantile regression in practice: 🔹 R: Use the quantreg package to apply quantile regression. The rq() function allows you to specify the quantiles you're interested in. 🔹 Python: In Python, statsmodels provides quantile regression with the QuantReg() function to analyze different percentiles of your data. The attached visualization is based on a Wikipedia image (link: https://lnkd.in/e7eYbpPg) and illustrates quantile regression lines at various percentiles, showing how predicted values differ across the distribution. To explain this topic in further detail, I collaborated with Micha Gengenbach to create a comprehensive tutorial: https://lnkd.in/eyb_DFr8 Curious to learn more about statistics and R programming? Join my online course, "Statistical Methods in R." See this link for additional information: https://lnkd.in/d-UAgcYf #statistical #analysis #database
To view or add a comment, sign in
-
-
Outliers may look like just “one or two unusual points,” but their impact can be huge! They can distort the regression line, bias coefficients, reduce reliability, and violate model assumptions. That’s why — before interpreting any regression results — it’s essential to identify, investigate, and handle outliers carefully. #Statistics #RegressionAnalysis #DataScience
Outliers can have a significant impact on regression analysis, often skewing the results and leading to misleading insights. Understanding how outliers affect regression models is essential for accurate data analysis and informed decision-making. Challenges of Ignoring Outliers: ❌ Skewed Results: Outliers can significantly skew the regression line, leading to incorrect conclusions about the relationship between variables. ❌ Reduced Model Performance: A model that fails to account for outliers may have reduced predictive power and accuracy. ❌ Misleading Interpretations: Outliers can create false impressions of trends and correlations that don't genuinely exist in the data. The visualization of this post demonstrates how outliers can significantly affect a regression model. On the left, the plot shows a linear regression without outliers, where the regression line accurately represents the relationship between the predictor and target variables. On the right, the plot includes several outliers at the top right, clearly illustrating how these extreme values can distort the regression line, making it less representative of the overall data trend and leading to potential misinterpretations. Note: Extreme values should not be removed without careful evaluation. This example uses a synthetic data set for illustration purposes. However, in practice, it is crucial to thoroughly assess whether removing extreme data points is appropriate. Often, alternative methods, such as data transformation and robust regression can address outliers effectively while preserving data integrity. Handling Outliers in Practice: 🔹 R: Use the dplyr package for data manipulation and ggplot2 for visualizing the impact of outliers on regression. 🔹 Python: Leverage pandas for handling data and matplotlib or seaborn for creating visual representations to analyze the effect of outliers. To dive deeper into concepts like this, join my online course on Statistical Methods in R. Learn more by visiting this link: https://lnkd.in/d-UAgcYf #analysisskills #datavisualization #dataanalytics #datastructure #visualanalytics
To view or add a comment, sign in
-
-
🚀 New Blog - Exploratory Data Analysis (EDA) I’m excited to share my latest blog: “Mastering Exploratory Data Analysis (EDA)!” https://eda1.hashnode.dev/ EDA is a crucial step in any Data Science or Machine Learning workflow. Instead of jumping directly into modeling, EDA helps us understand the dataset, detect missing values, identify patterns, and visualize relationships between features. I practiced EDA using the dataset: ✔ Viewing dataset structure (head, sample, shape) ✔ Checking class distributions ✔ Detecting missing values ✔ Performing correlation analysis with heatmaps ✔ Visualizing feature relationships using pairplots Key takeaway: "Better understanding of data leads to better models." #DataScience #EDA #MachineLearning #Python #Visualization #LearningJourney
To view or add a comment, sign in
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development
Which measure do you use for your Model Accuracy Checking?