Catch Hidden Duplicate Records in Your Data with Python

🔶drop_duplicates() catches exact copies. But real data has a sneakier problem that it completely misses. Same person. Slightly different entry. Same Employee ID. Employee_ID: 10234 | Name: John Kamau | Dept: Sales Employee_ID: 10234 | Name: J. Kamau | Dept: sales Those look different enough that drop_duplicates() won’t touch them. But they’re the same person entered twice. Here’s how to catch it: # Find IDs appearing more than once duplicate_ids = df[ df.duplicated(subset=[“Employee_ID”], keep=False) ] print(f“Records with duplicate IDs: {len(duplicate_ids)}”) print(duplicate_ids.sort_values(“Employee_ID”).head(20)) 🔷This shows every row that shares an ID with another row. Now you can actually investigate instead of guessing. The fix depends on what you find: # Keep only the most recent entry per employee df = df.sort_values(“date_added”, ascending=False) df = df.drop_duplicates(subset=[“Employee_ID”], keep=“first”) Soft duplicates are dangerous for one reason: your analysis treats one person as two data points. Your model learns from the same person twice. Your headcount reports are wrong from the start. And none of it raises an error. Everything looks fine. 📍Check for duplicates by key columns, not just identical rows. That extra step catches what the default function misses. ❓Have you ever found soft duplicates in a dataset? What gave it away? #DataCleaning #Python #DataScience

To view or add a comment, sign in

Explore content categories