Correlation tells you what moved together. Causal inference tells you what actually caused it. After this, you'll be able to estimate the true causal effect of any intervention : a promo, a product change, a policy shift - from observational data. No A/B test required. The technique: Propensity Score Matching (PSM) in Python. 𝗦𝘁𝗲𝗽 𝟭 :𝗜𝗻𝘀𝘁𝗮𝗹𝗹 ```bash pip install causalinference ``` 𝗦𝘁𝗲𝗽 𝟮 :𝗣𝗿𝗲𝗽𝗮𝗿𝗲 𝘆𝗼𝘂𝗿 𝗱𝗮𝘁𝗮 You need three columns: outcome Y, binary treatment D, and confounders X. ```python import pandas as pd df = pd.read_csv("observational_data.csv") Y = df["revenue"].values D = df["received_promo"].values # 1 = treated, 0 = control X = df[["age", "tenure", "spend_last_90d"]].values ``` 𝗦𝘁𝗲𝗽 𝟯 : 𝗕𝘂𝗶𝗹𝗱 𝗮𝗻𝗱 𝗿𝘂𝗻 𝘁𝗵𝗲 𝗺𝗼𝗱𝗲𝗹 ```python from causalinference import CausalModel model = CausalModel(Y, D, X) model.est_via_matching() print(model.estimates) ``` 𝗦𝘁𝗲𝗽 𝟰 : Read your results The key output is ATE (Average Treatment Effect) - the estimated causal lift, adjusted for selection bias. 📌 Always run `model.summary_stats` first. If treated and control groups don't overlap in propensity score distribution, your estimate is invalid — check covariate balance before trusting any number. The result: instead of "promo users had 23% higher revenue," you can say "the promo caused a £42 average revenue lift, controlling for age and prior spend." That's a claim your finance team can't easily dismiss. Have you applied causal inference in a real project? What's the hardest part to justify to non-technical stakeholders? #DataAnalytics #Data #Python #DataScience #Analytics #Statistics #CausalInference #BusinessIntelligence
Dwiti Bhavsar’s Post
More Relevant Posts
-
📊 Beyond the Bell Curve: Handling "Messy" Data in Python As data scientists, we often dream of perfect, Gaussian (normal) distributions. But in the real world—especially with variables like car prices or housing data—the data is rarely "normal." I recently worked through a project involving Left-Skewed and Non-Parametric data. Here’s a breakdown of how I handled it using Python: 1️⃣ Identifying the Shape Before running any tests, I used Matplotlib to visualize the distribution. A high bin count (150) helped reveal a significant Left Skew, where the mean was being pulled down by a long tail of lower-priced entries. Python plt.hist(prices, bins=150) plt.show(); 2️⃣ The Transformation Strategy When data is left-skewed, standard parametric tests (like T-Tests) can become biased. To pull that "tail" back toward the center and achieve symmetry, I explored Square ($x^2$) and Cube ($x^3$) transformations. By stretching the right side of the distribution more than the left, these mathematical shifts can often "normalize" the data, allowing for more powerful statistical modeling. 3️⃣ When to Stay Non-Parametric If the data is truly "Non-Parametric" (multimodal or containing extreme gaps), forcing a transformation isn't the answer. In those cases, I pivot to Rank-Based tests like: ✅ Mann-Whitney U (instead of T-Test) ✅ Kruskal-Wallis (instead of ANOVA) ✅ Spearman’s Rank (instead of Pearson Correlation) The takeaway: Don't just import your library and hit "run." Understanding the geometry of your data is the difference between a biased model and an accurate insight. 💡 #DataScience #Python #Statistics #MachineLearning #Pandas #DataAnalytics #DataIntegrity
To view or add a comment, sign in
-
-
Most people approach data analytics as a checklist of tools. That’s the wrong approach. High-quality work comes from understanding structure, not just execution. At the core sits business understanding. Everything else supports it. Data comes in. It gets cleaned. Then explored using SQL or Python. Findings are shaped into visuals. Finally, those visuals are turned into decisions. Add AI on top, and the speed increases. But clarity still depends on how well the foundation is built. Here’s where most go wrong: They jump straight to dashboards. They skip context. They ignore data quality. The result looks good, but fails in real decisions. Strong analysts don’t work in steps. They think in systems. Every part connects. Every layer affects the outcome. If one piece is weak, everything built on top of it becomes unreliable. That’s the difference between reporting numbers and driving decisions. Your weakest link? #dataanalytics #businessanalytics #datascience #datavisualization #powerbi #sql #python #aiforbusiness #datastorytelling
To view or add a comment, sign in
-
-
📊 𝗖𝗵𝗲𝗰𝗸 𝗠𝗶𝘀𝘀𝗶𝗻𝗴 𝗩𝗮𝗹𝘂𝗲𝘀 𝗶𝗻 𝗗𝗮𝘁𝗮𝘀𝗲𝘁 Before building any ML model, always check for missing values ❗ Ignoring them can lead to poor results 😬 🔍➤ 1) Check total missing values (count) df.isna().sum() ➡️ Shows missing count per column 📊 📉 ➤ 2) Missing values percentage (in %) (df.isna().sum() / len(df)) * 100 ➡️ Helps decide whether to drop 🗑️ or fill(Imputation) 🧩 📊 𝗩𝗶𝘀𝘂𝗮𝗹𝗶𝘇𝗲 𝗠𝗶𝘀𝘀𝗶𝗻𝗴 𝗩𝗮𝗹𝘂𝗲𝘀 📌 ➤ 1) Bar Chart df.isna().sum().plot(kind='bar', figsize=(15,4)) 🔥 ➤ 2) Heatmap import seaborn as sns import matplotlib.pyplot as plt plt.figure(figsize=(12,6)) sns.heatmap(df.isna(), cbar=False) plt.title("Missing Value Heatmap") plt.show() 🎨 Dark color (almost black / blue) → Value is NOT missing ✅ (data is present) ⚪ Light / white color → Value IS missing ❌ (NaN) 📑 𝗦𝘂𝗺𝗺𝗮𝗿𝘆 𝗧𝗮𝗯𝗹𝗲 (Clean Report) missing_report = pd.DataFrame({ "missing_count": df.isna().sum(), "missing_pct": df.isna().mean() * 100 }).sort_values(by="missing_pct", ascending=False) missing_report 🚀 Clean Data = Better Models 💯 Always handle missing values before training! #DataScience #MachineLearning #Python #DataAnalysis #GitHub #LearningJourney
To view or add a comment, sign in
-
-
𝗜𝗳 𝘆𝗼𝘂 𝘄𝗼𝗿𝗸 𝘄𝗶𝘁𝗵 𝗱𝗮𝘁𝗮, 𝘆𝗼𝘂 𝗸𝗻𝗼𝘄 𝘁𝗵𝗶𝘀 — 𝗽𝗵𝗼𝗻𝗲 𝗻𝘂𝗺𝗯𝗲𝗿𝘀 𝗮𝗿𝗲 𝗻𝗲𝘃𝗲𝗿 𝗰𝗹𝗲𝗮𝗻 Sometimes they come with spaces, sometimes with country codes, sometimes with special characters like “+”, “-”, or even brackets. And sometimes, they even come with .00 at the end because of how data is stored or exported. And if we don’t clean them properly, it becomes very difficult to use that data for analysis or communication. In Pandas, cleaning phone number columns is actually simple once you understand the approach. First, I usually convert the column to string format. This avoids unexpected issues, especially when numbers are stored as integers, floats, or mixed types. After that, the main step is removing unwanted characters. Using regular expressions, we can keep only digits and remove everything else — including .00, symbols, and spaces. For example: df['phone'] = df['phone'].astype(str).str.replace(r'[^0-9]', '', regex=True) This one line can handle most messy formats. One important step I always follow is standardizing the final output. No matter how the number comes, I take only the last 10 digits. This helps remove country codes like +91 and keeps the data consistent. Something like: df['phone'] = df['phone'].str[-10:] Next comes validation. Not every cleaned number is valid. Some may be too short or too long. So I often filter numbers based on length to make sure we only keep meaningful data. If needed, I also format the numbers again in a clean and readable way. What I learned from this is simple — data cleaning is not about writing complex code, it’s about thinking clearly about the problem. Once the logic is clear, Pandas makes the job very easy. Small steps like this make a big difference when working with large datasets. #DataScience #DataAnalytics #Python #Pandas #DataCleaning
To view or add a comment, sign in
-
-
I remember coming across this a few years ago in a course textbook. A perfect illustration of the power of visual storytelling behind data when numbers hold hidden nuances.
Assistant Professor of Socio-Computing | Machine Learning Instructor | Academic Researcher in Statistics & Data Science
4 datasets. Same mean. Same variance. Same correlation. Same regression line. Yet they look completely different when you plot them. 👇 This is 𝑨𝒏𝒔𝒄𝒐𝒎𝒃𝒆'𝒔 𝑸𝒖𝒂𝒓𝒕𝒆𝒕 — a simple but powerful lesson every data analyst needs to know. In 1973, statistician Francis Anscombe created 4 datasets that are statistically identical on paper: ✅ Same mean ✅ Same variance ✅ Same correlation ✅ Same linear regression line But the moment you visualize them? They tell 4 completely different stories. Dataset 1 → Clean linear relationship (the "normal" one) Dataset 2 → A curve — linear regression is the wrong model entirely Dataset 3 → Perfect line with ONE outlier destroying everything Dataset 4 → All points identical except one extreme outlier The numbers said they were the same. The charts said otherwise. The lesson? 👉 Never trust summary statistics alone. 👉 Always visualize your data BEFORE analysis. 👉 Outliers, curves, and patterns hide behind averages. In Python, you can explore Anscombe's Quartet in 2 lines: import seaborn as sns df = sns.load_dataset('anscombe') Next time someone hands you a mean and a correlation coefficient — ask to see the plot first 😀😁. #DataLessonsWithSahar #DataTrapsBySahar #DataScience #DataVisualization #Python #Statistics #DataAnalysis #MachineLearning #Analytics
To view or add a comment, sign in
-
-
Great post on the power of simple plots (e.g. Anscombe). Data scientists are like airline pilots who can fly with instruments but ultimately need to see the runaway. #AnscombeQuartet #Seaborn #Plot
Assistant Professor of Socio-Computing | Machine Learning Instructor | Academic Researcher in Statistics & Data Science
4 datasets. Same mean. Same variance. Same correlation. Same regression line. Yet they look completely different when you plot them. 👇 This is 𝑨𝒏𝒔𝒄𝒐𝒎𝒃𝒆'𝒔 𝑸𝒖𝒂𝒓𝒕𝒆𝒕 — a simple but powerful lesson every data analyst needs to know. In 1973, statistician Francis Anscombe created 4 datasets that are statistically identical on paper: ✅ Same mean ✅ Same variance ✅ Same correlation ✅ Same linear regression line But the moment you visualize them? They tell 4 completely different stories. Dataset 1 → Clean linear relationship (the "normal" one) Dataset 2 → A curve — linear regression is the wrong model entirely Dataset 3 → Perfect line with ONE outlier destroying everything Dataset 4 → All points identical except one extreme outlier The numbers said they were the same. The charts said otherwise. The lesson? 👉 Never trust summary statistics alone. 👉 Always visualize your data BEFORE analysis. 👉 Outliers, curves, and patterns hide behind averages. In Python, you can explore Anscombe's Quartet in 2 lines: import seaborn as sns df = sns.load_dataset('anscombe') Next time someone hands you a mean and a correlation coefficient — ask to see the plot first 😀😁. #DataLessonsWithSahar #DataTrapsBySahar #DataScience #DataVisualization #Python #Statistics #DataAnalysis #MachineLearning #Analytics
To view or add a comment, sign in
-
-
Nobody warns you about this when you start working with data. I once had a huge dataset with multiple subheaders, inconsistent formatting, and way too much going on. Honestly, I did not even know where to start. I spent so much time just trying to make sense of it before even writing a single line of analysis. And even after cleaning it, the work was not over. Understanding what the data is actually saying, digging through it, and finding meaningful insights...that is a whole different challenge. And it takes time. A lot of it. But when it finally clicked..when the data was clean, the insights made sense, and the dashboard actually came together, it felt like I had moved mountains. That is when I realized that the real work in data is not the fancy visualization at the end. It is everything that comes before it : cleaning, restructuring, understanding, and finding the story hidden in the numbers. That part does not get talked about enough. But honestly, that is where most of the learning happens. #DataAnalytics #Python #Pandas #DataVisualization #DashboardDesign
To view or add a comment, sign in
-
-
🔶drop_duplicates() catches exact copies. But real data has a sneakier problem that it completely misses. Same person. Slightly different entry. Same Employee ID. Employee_ID: 10234 | Name: John Kamau | Dept: Sales Employee_ID: 10234 | Name: J. Kamau | Dept: sales Those look different enough that drop_duplicates() won’t touch them. But they’re the same person entered twice. Here’s how to catch it: # Find IDs appearing more than once duplicate_ids = df[ df.duplicated(subset=[“Employee_ID”], keep=False) ] print(f“Records with duplicate IDs: {len(duplicate_ids)}”) print(duplicate_ids.sort_values(“Employee_ID”).head(20)) 🔷This shows every row that shares an ID with another row. Now you can actually investigate instead of guessing. The fix depends on what you find: # Keep only the most recent entry per employee df = df.sort_values(“date_added”, ascending=False) df = df.drop_duplicates(subset=[“Employee_ID”], keep=“first”) Soft duplicates are dangerous for one reason: your analysis treats one person as two data points. Your model learns from the same person twice. Your headcount reports are wrong from the start. And none of it raises an error. Everything looks fine. 📍Check for duplicates by key columns, not just identical rows. That extra step catches what the default function misses. ❓Have you ever found soft duplicates in a dataset? What gave it away? #DataCleaning #Python #DataScience
To view or add a comment, sign in
-
I finally understand why data scientists say they spend 80% of their time on data. 📊 This week, instead of just reading about the ML lifecycle, I actually did the second step: Data Collection. 🎯 I built my own dataset called "TMDB Top Rated Movies" using their public API. 🎬 It was interesting to see how data can come from different sources some datasets are already available in formats like CSV and JSON, while others can be retrieved using SQL databases. I also learned that data can be collected through APIs or even web scraping depending on the use case. Nothing fancy. Just: 🐍 Python 📡 A bunch of API calls 🔄 Figuring out how to loop through pages without breaking everything In the end, I pulled together 10,000+ movie records clean, structured, and ready for actual analysis or ML. 📁✅ This part felt more like real engineering than anything I have done in a notebook. 🛠️ Small step. But it's real. 🚀 dataset link: https://lnkd.in/dG7EcE5q #MachineLearning #DataScience #Python #LearningByDoing
To view or add a comment, sign in
-
-
🚀 Data Cleaning & Exploratory Data Analysis (EDA) in Action Yesterday, I worked on cleaning and analyzing a real-world dataset using Python (Pandas, Matplotlib, Seaborn). Here’s a quick summary of what I explored: 🔹 Data Type Conversion Converted the Price column into numeric (float64) format, making it ready for analysis and calculations. 🔹 Descriptive Statistics Using df.describe(), I discovered: Most app ratings are between 4.0 – 4.5 App prices are mostly free, with a few outliers up to $400 Installs are highly skewed, with some apps reaching 1B+ downloads 🔹 Missing Values Analysis Found a total of 4,881 missing values Highest missing data in: Size (~15.6%) Rating (~13.6%) Other columns had minimal or no missing values 🔹 Data Quality Insights Detected outliers in Price and Rating Identified skewed distributions in Installs and Price Highlighted columns requiring data cleaning 🔹 Visualization Created a heatmap using Seaborn to visually identify missing values across the dataset 📊 💡 Key Learning: Before jumping into modeling, understanding your data through EDA and cleaning is critical. It helps uncover hidden patterns, errors, and insights that directly impact results. 🔥 More projects coming soon on my GitHub! Let’s connect and grow together in Data Analytics 🚀 #DataAnalytics #Python #Pandas #DataCleaning #EDA #Seaborn #Matplotlib #MachineLearning #DataScience
To view or add a comment, sign in
Explore related topics
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development