Matúš Senci’s Post

Python in Data Science #006 A funny thing happens in real projects: the “modeling work” starts failing, and the root cause is almost always upstream. Not because the algorithm is wrong, but because the data cleaning was ad-hoc, inconsistent, and almost impossible to reproduce. Always treat data cleaning as a repeatable, versioned transformation, and never clean directly on raw data. A cheatsheet is useful, but the real upgrade is turning those steps (missing values, duplicates, types, outliers, invalid rows) into a predictable workflow you can rerun tomorrow and get the same dataset. It also reduces silent leakage: if you “peek” at the full dataset to decide thresholds or imputation, you can accidentally bake test-set information into training. The trade-off is a bit more upfront discipline, but you gain trust: in your results, in your features, and in your handoffs to stakeholders. df_raw = pd.read_csv("data.csv") df = df_raw.copy() df = df.drop_duplicates() df["date"] = pd.to_datetime(df["date"], errors="coerce") df["sales"] = df["sales"].fillna(0) df["name"] = df["name"].str.strip().str.lower() df = df[df["sales"] >= 0] What it improves: reproducibility, debugging speed, and confidence that changes are intentional (not accidental) Common mistake/trap: “quick fixes” in-place on raw data, then forgetting what was changed (or applying different rules each run) When I’d tune it (or when I wouldn’t): I tune cleaning rules only on the training split (thresholds, outlier caps, imputations); I don’t touch rules based on the full dataset. #python #datascience #datacleaning

To view or add a comment, sign in

Explore content categories