SQL & PySpark String Functions for Data Cleaning

🚀 Day 11 of 30 — SQL & PySpark Challenge Series 📌 String Functions — TRIM, SPLIT, REGEXP & CONCAT Raw data is always messy. Extra spaces, mixed case, comma-separated tags, emails as plain text. Here's the complete string-cleaning toolkit in SQL & PySpark. ───────────────────────────── 🗄️ #SQL: SELECT UPPER(TRIM(name)) AS clean_name, LOWER(TRIM(city)) AS clean_city, SPLIT_PART(tags, ',', 1) AS first_tag, REGEXP_EXTRACT(email, '@(.+)', 1) AS email_domain, CONCAT(UPPER(TRIM(name)), ' | ', LOWER(TRIM(city))) AS label FROM customers; ───────────────────────────── ⚡ #PySpark: df \ .withColumn("clean_name", F.upper(F.trim(F.col("name")))) \ .withColumn("clean_city", F.lower(F.trim(F.col("city")))) \ .withColumn("first_tag", F.split(F.col("tags"), ",").getItem(0)) \ .withColumn("email_domain", F.regexp_extract(F.col("email"), r'@(.+)', 1)) \ .withColumn("label", F.concat(F.upper(F.trim(F.col("name"))), F.lit(" | "), F.lower(F.trim(F.col("city"))))) ───────────────────────────── 💡 Quick reference: TRIM → strip leading/trailing spaces (always do this first!) SPLIT_PART → extract from delimited string by position REGEXP_EXTRACT → pull capture groups — most powerful string tool in SQL CONCAT → combine columns with literals (use F.lit() in PySpark) ⚠️ Dialect watch: SPLIT_PART → Snowflake, Redshift, DuckDB SPLIT()[0] → Spark SQL / BigQuery Always check your platform's string docs! #SQL #PySpark #Databricks #SparkSQL #Snowflake #BigQuery #DataEngineering #DataEngineer #ETL #dbt #ApacheSpark #30DayChallenge #SQLTips #DataQuality #DataPipeline

To view or add a comment, sign in

Explore content categories