Catch Hidden Duplicate Records in Your Data with Python

🔶drop_duplicates() catches exact copies. But real data has a sneakier problem that it completely misses. Same person. Slightly different entry. Same Employee ID. Employee_ID: 10234 | Name: John Kamau | Dept: Sales Employee_ID: 10234 | Name: J. Kamau | Dept: sales Those look different enough that drop_duplicates() won’t touch them. But they’re the same person entered twice. Here’s how to catch it: # Find IDs appearing more than once duplicate_ids = df[ df.duplicated(subset=[“Employee_ID”], keep=False) ] print(f“Records with duplicate IDs: {len(duplicate_ids)}”) print(duplicate_ids.sort_values(“Employee_ID”).head(20)) 🔷This shows every row that shares an ID with another row. Now you can actually investigate instead of guessing. The fix depends on what you find: # Keep only the most recent entry per employee df = df.sort_values(“date_added”, ascending=False) df = df.drop_duplicates(subset=[“Employee_ID”], keep=“first”) Soft duplicates are dangerous for one reason: your analysis treats one person as two data points. Your model learns from the same person twice. Your headcount reports are wrong from the start. And none of it raises an error. Everything looks fine. 📍Check for duplicates by key columns, not just identical rows. That extra step catches what the default function misses. ❓Have you ever found soft duplicates in a dataset? What gave it away? #DataCleaning #Python #DataScience

To view or add a comment, sign in

More Relevant Posts

Gopikrishna Ravipati
1w
Report this post
🚀 🔥 𝑺𝒕𝒐𝒑 𝑺𝒕𝒓𝒖𝒈𝒈𝒍𝒊𝒏𝒈 𝒘𝒊𝒕𝒉 𝑫𝒊𝒓𝒕𝒚 𝑫𝒂𝒕𝒂 — 𝑴𝒂𝒔𝒕𝒆𝒓 𝑷𝒚𝒕𝒉𝒐𝒏 𝑫𝒂𝒕𝒂 𝑪𝒍𝒆𝒂𝒏𝒊𝒏𝒈 𝒊𝒏 𝑴𝒊𝒏𝒖𝒕𝒆𝒔 (2026) Most people learn Python… But fail at real data work ❌ Because they ignore ONE skill 👇 👉 Data Cleaning ⚡ Here’s your cheat sheet to become a PRO: 🧹 Fix Missing Data df.isnull().sum() df.fillna(method='ffill') df.dropna() 🧹 Remove Duplicates df.drop_duplicates() 🧹 Understand Your Data df.head() df.info() df.describe() 🧹 Clean Columns df.rename(columns={'old':'new'}) df.astype({'col':'int'}) 🧹 Filter Smartly df.query("salary > 50000") df[df['role'].isin(['DE','DS'])] 🧹 Merge Like a Pro pd.merge(df1, df2, on='id') df.groupby('team').agg({'salary':'mean'}) 🎯 Reality Check (2026): 👉 80% of time = Cleaning data 👉 20% of time = Analysis If your data is messy → your results are wrong ❌ 💬 Engagement Hook: Be honest — Do you enjoy data cleaning or hate it? 😅👇 #Python #Pandas #DataCleaning #DataEngineering #DataScience #MachineLearning #Analytics #LearnPython #TechCareers #Coding #BigData
Like Comment
To view or add a comment, sign in
R Kishore Reddy
5d
Report this post
𝗜𝗳 𝘆𝗼𝘂 𝘄𝗼𝗿𝗸 𝘄𝗶𝘁𝗵 𝗱𝗮𝘁𝗮, 𝘆𝗼𝘂 𝗸𝗻𝗼𝘄 𝘁𝗵𝗶𝘀 — 𝗽𝗵𝗼𝗻𝗲 𝗻𝘂𝗺𝗯𝗲𝗿𝘀 𝗮𝗿𝗲 𝗻𝗲𝘃𝗲𝗿 𝗰𝗹𝗲𝗮𝗻 Sometimes they come with spaces, sometimes with country codes, sometimes with special characters like “+”, “-”, or even brackets. And sometimes, they even come with .00 at the end because of how data is stored or exported. And if we don’t clean them properly, it becomes very difficult to use that data for analysis or communication. In Pandas, cleaning phone number columns is actually simple once you understand the approach. First, I usually convert the column to string format. This avoids unexpected issues, especially when numbers are stored as integers, floats, or mixed types. After that, the main step is removing unwanted characters. Using regular expressions, we can keep only digits and remove everything else — including .00, symbols, and spaces. For example: df['phone'] = df['phone'].astype(str).str.replace(r'[^0-9]', '', regex=True) This one line can handle most messy formats. One important step I always follow is standardizing the final output. No matter how the number comes, I take only the last 10 digits. This helps remove country codes like +91 and keeps the data consistent. Something like: df['phone'] = df['phone'].str[-10:] Next comes validation. Not every cleaned number is valid. Some may be too short or too long. So I often filter numbers based on length to make sure we only keep meaningful data. If needed, I also format the numbers again in a clean and readable way. What I learned from this is simple — data cleaning is not about writing complex code, it’s about thinking clearly about the problem. Once the logic is clear, Pandas makes the job very easy. Small steps like this make a big difference when working with large datasets. #DataScience #DataAnalytics #Python #Pandas #DataCleaning
Like Comment
To view or add a comment, sign in
Bhavani Jaladi
2w
Report this post
Most people approach data analytics as a checklist of tools. That’s the wrong approach. High-quality work comes from understanding structure, not just execution. At the core sits business understanding. Everything else supports it. Data comes in. It gets cleaned. Then explored using SQL or Python. Findings are shaped into visuals. Finally, those visuals are turned into decisions. Add AI on top, and the speed increases. But clarity still depends on how well the foundation is built. Here’s where most go wrong: They jump straight to dashboards. They skip context. They ignore data quality. The result looks good, but fails in real decisions. Strong analysts don’t work in steps. They think in systems. Every part connects. Every layer affects the outcome. If one piece is weak, everything built on top of it becomes unreliable. That’s the difference between reporting numbers and driving decisions. Your weakest link? #dataanalytics #businessanalytics #datascience #datavisualization #powerbi #sql #python #aiforbusiness #datastorytelling
Like Comment
To view or add a comment, sign in
Dwiti Bhavsar
1w
Report this post
Correlation tells you what moved together. Causal inference tells you what actually caused it. After this, you'll be able to estimate the true causal effect of any intervention : a promo, a product change, a policy shift - from observational data. No A/B test required. The technique: Propensity Score Matching (PSM) in Python. 𝗦𝘁𝗲𝗽 𝟭 :𝗜𝗻𝘀𝘁𝗮𝗹𝗹 ```bash pip install causalinference ``` 𝗦𝘁𝗲𝗽 𝟮 :𝗣𝗿𝗲𝗽𝗮𝗿𝗲 𝘆𝗼𝘂𝗿 𝗱𝗮𝘁𝗮 You need three columns: outcome Y, binary treatment D, and confounders X. ```python import pandas as pd df = pd.read_csv("observational_data.csv") Y = df["revenue"].values D = df["received_promo"].values # 1 = treated, 0 = control X = df[["age", "tenure", "spend_last_90d"]].values ``` 𝗦𝘁𝗲𝗽 𝟯 : 𝗕𝘂𝗶𝗹𝗱 𝗮𝗻𝗱 𝗿𝘂𝗻 𝘁𝗵𝗲 𝗺𝗼𝗱𝗲𝗹 ```python from causalinference import CausalModel model = CausalModel(Y, D, X) model.est_via_matching() print(model.estimates) ``` 𝗦𝘁𝗲𝗽 𝟰 : Read your results The key output is ATE (Average Treatment Effect) - the estimated causal lift, adjusted for selection bias. 📌 Always run `model.summary_stats` first. If treated and control groups don't overlap in propensity score distribution, your estimate is invalid — check covariate balance before trusting any number. The result: instead of "promo users had 23% higher revenue," you can say "the promo caused a £42 average revenue lift, controlling for age and prior spend." That's a claim your finance team can't easily dismiss. Have you applied causal inference in a real project? What's the hardest part to justify to non-technical stakeholders? #DataAnalytics #Data #Python #DataScience #Analytics #Statistics #CausalInference #BusinessIntelligence
Like Comment
To view or add a comment, sign in
Shafiq Ahmed
3w
Report this post
🚀 Data Cleaning & Exploratory Data Analysis (EDA) in Action Yesterday, I worked on cleaning and analyzing a real-world dataset using Python (Pandas, Matplotlib, Seaborn). Here’s a quick summary of what I explored: 🔹 Data Type Conversion Converted the Price column into numeric (float64) format, making it ready for analysis and calculations. 🔹 Descriptive Statistics Using df.describe(), I discovered: Most app ratings are between 4.0 – 4.5 App prices are mostly free, with a few outliers up to $400 Installs are highly skewed, with some apps reaching 1B+ downloads 🔹 Missing Values Analysis Found a total of 4,881 missing values Highest missing data in: Size (~15.6%) Rating (~13.6%) Other columns had minimal or no missing values 🔹 Data Quality Insights Detected outliers in Price and Rating Identified skewed distributions in Installs and Price Highlighted columns requiring data cleaning 🔹 Visualization Created a heatmap using Seaborn to visually identify missing values across the dataset 📊 💡 Key Learning: Before jumping into modeling, understanding your data through EDA and cleaning is critical. It helps uncover hidden patterns, errors, and insights that directly impact results. 🔥 More projects coming soon on my GitHub! Let’s connect and grow together in Data Analytics 🚀 #DataAnalytics #Python #Pandas #DataCleaning #EDA #Seaborn #Matplotlib #MachineLearning #DataScience
Like Comment
To view or add a comment, sign in
Joseph Lira
3w
Report this post
📊 Beyond the Bell Curve: Handling "Messy" Data in Python As data scientists, we often dream of perfect, Gaussian (normal) distributions. But in the real world—especially with variables like car prices or housing data—the data is rarely "normal." I recently worked through a project involving Left-Skewed and Non-Parametric data. Here’s a breakdown of how I handled it using Python: 1️⃣ Identifying the Shape Before running any tests, I used Matplotlib to visualize the distribution. A high bin count (150) helped reveal a significant Left Skew, where the mean was being pulled down by a long tail of lower-priced entries. Python plt.hist(prices, bins=150) plt.show(); 2️⃣ The Transformation Strategy When data is left-skewed, standard parametric tests (like T-Tests) can become biased. To pull that "tail" back toward the center and achieve symmetry, I explored Square ($x^2$) and Cube ($x^3$) transformations. By stretching the right side of the distribution more than the left, these mathematical shifts can often "normalize" the data, allowing for more powerful statistical modeling. 3️⃣ When to Stay Non-Parametric If the data is truly "Non-Parametric" (multimodal or containing extreme gaps), forcing a transformation isn't the answer. In those cases, I pivot to Rank-Based tests like: ✅ Mann-Whitney U (instead of T-Test) ✅ Kruskal-Wallis (instead of ANOVA) ✅ Spearman’s Rank (instead of Pearson Correlation) The takeaway: Don't just import your library and hit "run." Understanding the geometry of your data is the difference between a biased model and an accurate insight. 💡 #DataScience #Python #Statistics #MachineLearning #Pandas #DataAnalytics #DataIntegrity
Like Comment
To view or add a comment, sign in
Vera Kinya
2w
Report this post
Wednesday Data Tip: One thing I’m learning while working on data projects: Not all insights are useful. It’s easy to find patterns in data. But the real question is: Does this insight actually help someone make a decision? Good analysis goes beyond: • identifying trends • building dashboards It focuses on: • relevance • clarity • impact Before sharing any result, I try to ask: “What action can be taken from this?” If there’s no clear action, the insight might not be as valuable as it seems. Still learning. Still building. #DataAnalytics #SQL #Python #BusinessIntelligence #LearningInPublic

4 Comments
Like Comment
To view or add a comment, sign in
Sahil Singh
1mo
Report this post
Earlier, I used to think data analysis was all about dashboards, visualizations, and complex models. But while working with real datasets, I’ve realized something important — data preprocessing is where the real work happens. Most data is messy. It comes with missing values, inconsistent formats, duplicates, and sometimes even wrong entries. If we skip cleaning and preparing it properly, the final analysis can be completely misleading. Preprocessing may not look exciting, but it builds the foundation for everything that comes after — whether it’s analysis, visualization, or machine learning. I’m learning that even small steps like cleaning columns, handling missing data, or structuring information correctly can make a huge difference. In the end, it’s simple: Better data leads to better insights. #DataAnalytics #DataScience #LearningJourney #Python
Like Comment
To view or add a comment, sign in
Sanjay G
4d
Report this post
🚀 Today’s Learning: Pivot Table & Data Merge in Python Working with data becomes powerful when you can both summarize and combine it effectively! 🔹 Pivot Table (using pandas) Pivot tables are powerful for summarizing large datasets into a structured format. They help in identifying patterns, trends, and comparisons across categories 💻 Example: import pandas as pd data = { 'Region': ['North', 'South', 'East', 'West'], 'Sales': [100, 150, 200, 130] } df = pd.DataFrame(data) pivot = pd.pivot_table(df, values='Sales', index='Region', aggfunc='sum') print(pivot) 📌 Output: Region East 200 North 100 South 150 West 130 🔹 Data Merge (Combining datasets) Data merging is used to combine datasets based on a common key, similar to SQL joins. This is very useful when working with multiple tables like customers, orders, and products. 💻 Example: df1 = pd.DataFrame({ 'ID': [1, 2, 3], 'Name': ['A', 'B', 'C'] }) df2 = pd.DataFrame({ 'ID': [1, 2, 3], 'Score': [90, 85, 88] }) merged = pd.merge(df1, df2, on='ID') print(merged) 📌 Output: ID Name Score 0 1 A 90 1 2 B 85 2 3 C 88 ✨ Pivot to analyze. Merge to integrate. Together, they transform raw data into actionable insights! #Python #Pandas #DataAnalytics #DataScience #Learning #PivotTable #DataMerge
Like Comment
To view or add a comment, sign in
Shilpi Salwan
3w Edited
Report this post
Real-world "Hidden Duplicates" You Didn't Know You Had You're staring at a dataset where customer counts don't add up. You check for duplicates. Nothing. 👉 Here's the problem: most duplicate checks only catch exact matches. But duplicates don’t always repeat — sometimes they disguise themselves. 👉 They hide in formats, casing, spaces, labels — even in how systems store data. Your data quality problem might not be missing data. It might be data that’s there — just fragmented beyond recognition. ⚠️ Result: Broken reports. Bad decisions. Numbers you can’t trust. Before you start analyzing your data: ✅ Normalize text fields ✅ Strip hidden spaces ✅ Standardize key columns #DataAnalytics #Python #DataCleaning #DataQuality #AnalyticsTips
Like Comment
To view or add a comment, sign in

1,541 followers

View Profile Follow

Catch Hidden Duplicate Records in Your Data with Python

More from this author

SDG 3: Life Expectancy Prediction

Explore content categories