Darya S.’s Post

Why do analysts spend 80% of their time on data preparation? ⏳ I recently faced a classic challenge: a CRM export full of "surprises." The Revenue column was a mess — a mix of numbers, currency symbols, "unknown" strings, and missing values. Running any calculation on this raw data would result in an immediate error. The Problem: Convert the data to a numeric type, strip out noise, and handle missing values without losing data or heavily skewing the statistics. ______________________ My Python Solution: import pandas as pd import numpy as np # Example of "dirty" data data = {'revenue': ['100$', ' 150 ', 'unknown', '1,200.50', None]} df = pd.DataFrame(data) # 1. Clean noise: keep only digits and the decimal point # Note: I'm assuming a dot is the decimal separator here df['revenue_clean'] = df['revenue'].str.replace(r'[^\d.]', '', regex=True) # 2. Convert to numeric (non-parseable values become NaN) df['revenue_clean'] = pd.to_numeric(df['revenue_clean'], errors='coerce') # 3. Handle missing values using median imputation # This preserves sample size and is less sensitive to outliers than the mean median_val = df['revenue_clean'].median() df['revenue_clean'] = df['revenue_clean'].fillna(median_val) print(df) ______________________ Why this approach? Regex Flexibility: It allows cleaning most currency formats in a single line. Strategic Coercion: Using errors='coerce' in to_numeric is a lifesaver. It systematically turns "garbage" strings into NaN, which Pandas handles natively. Median vs Mean: In financial data, outliers are common. Median imputation helps maintain the distribution better than a simple average. The Result: A reliable dataset ready for a dashboard or a deep dive. What are your go-to data cleaning methods? Let's discuss in the comments! 👇 #dataanalysis #python #pandas #datacleaning #analytics #datascience #sql #LinkedIn #analytics

  • No alternative text description for this image

To view or add a comment, sign in

Explore content categories