Cleaning Phone Numbers with Pandas in Python

𝗜𝗳 𝘆𝗼𝘂 𝘄𝗼𝗿𝗸 𝘄𝗶𝘁𝗵 𝗱𝗮𝘁𝗮, 𝘆𝗼𝘂 𝗸𝗻𝗼𝘄 𝘁𝗵𝗶𝘀 — 𝗽𝗵𝗼𝗻𝗲 𝗻𝘂𝗺𝗯𝗲𝗿𝘀 𝗮𝗿𝗲 𝗻𝗲𝘃𝗲𝗿 𝗰𝗹𝗲𝗮𝗻 Sometimes they come with spaces, sometimes with country codes, sometimes with special characters like “+”, “-”, or even brackets. And sometimes, they even come with .00 at the end because of how data is stored or exported. And if we don’t clean them properly, it becomes very difficult to use that data for analysis or communication. In Pandas, cleaning phone number columns is actually simple once you understand the approach. First, I usually convert the column to string format. This avoids unexpected issues, especially when numbers are stored as integers, floats, or mixed types. After that, the main step is removing unwanted characters. Using regular expressions, we can keep only digits and remove everything else — including .00, symbols, and spaces. For example: df['phone'] = df['phone'].astype(str).str.replace(r'[^0-9]', '', regex=True) This one line can handle most messy formats. One important step I always follow is standardizing the final output. No matter how the number comes, I take only the last 10 digits. This helps remove country codes like +91 and keeps the data consistent. Something like: df['phone'] = df['phone'].str[-10:] Next comes validation. Not every cleaned number is valid. Some may be too short or too long. So I often filter numbers based on length to make sure we only keep meaningful data. If needed, I also format the numbers again in a clean and readable way. What I learned from this is simple — data cleaning is not about writing complex code, it’s about thinking clearly about the problem. Once the logic is clear, Pandas makes the job very easy. Small steps like this make a big difference when working with large datasets. #DataScience #DataAnalytics #Python #Pandas #DataCleaning

To view or add a comment, sign in

More Relevant Posts

Gopikrishna Ravipati
1w
Report this post
🚀 🔥 𝑺𝒕𝒐𝒑 𝑺𝒕𝒓𝒖𝒈𝒈𝒍𝒊𝒏𝒈 𝒘𝒊𝒕𝒉 𝑫𝒊𝒓𝒕𝒚 𝑫𝒂𝒕𝒂 — 𝑴𝒂𝒔𝒕𝒆𝒓 𝑷𝒚𝒕𝒉𝒐𝒏 𝑫𝒂𝒕𝒂 𝑪𝒍𝒆𝒂𝒏𝒊𝒏𝒈 𝒊𝒏 𝑴𝒊𝒏𝒖𝒕𝒆𝒔 (2026) Most people learn Python… But fail at real data work ❌ Because they ignore ONE skill 👇 👉 Data Cleaning ⚡ Here’s your cheat sheet to become a PRO: 🧹 Fix Missing Data df.isnull().sum() df.fillna(method='ffill') df.dropna() 🧹 Remove Duplicates df.drop_duplicates() 🧹 Understand Your Data df.head() df.info() df.describe() 🧹 Clean Columns df.rename(columns={'old':'new'}) df.astype({'col':'int'}) 🧹 Filter Smartly df.query("salary > 50000") df[df['role'].isin(['DE','DS'])] 🧹 Merge Like a Pro pd.merge(df1, df2, on='id') df.groupby('team').agg({'salary':'mean'}) 🎯 Reality Check (2026): 👉 80% of time = Cleaning data 👉 20% of time = Analysis If your data is messy → your results are wrong ❌ 💬 Engagement Hook: Be honest — Do you enjoy data cleaning or hate it? 😅👇 #Python #Pandas #DataCleaning #DataEngineering #DataScience #MachineLearning #Analytics #LearnPython #TechCareers #Coding #BigData
Like Comment
To view or add a comment, sign in
Sameen Kainaat
1mo
Report this post
📊 𝗖𝗵𝗲𝗰𝗸 𝗠𝗶𝘀𝘀𝗶𝗻𝗴 𝗩𝗮𝗹𝘂𝗲𝘀 𝗶𝗻 𝗗𝗮𝘁𝗮𝘀𝗲𝘁 Before building any ML model, always check for missing values ❗ Ignoring them can lead to poor results 😬 🔍➤ 1) Check total missing values (count) df.isna().sum() ➡️ Shows missing count per column 📊 📉 ➤ 2) Missing values percentage (in %) (df.isna().sum() / len(df)) * 100 ➡️ Helps decide whether to drop 🗑️ or fill(Imputation) 🧩 📊 𝗩𝗶𝘀𝘂𝗮𝗹𝗶𝘇𝗲 𝗠𝗶𝘀𝘀𝗶𝗻𝗴 𝗩𝗮𝗹𝘂𝗲𝘀 📌 ➤ 1) Bar Chart df.isna().sum().plot(kind='bar', figsize=(15,4)) 🔥 ➤ 2) Heatmap import seaborn as sns import matplotlib.pyplot as plt plt.figure(figsize=(12,6)) sns.heatmap(df.isna(), cbar=False) plt.title("Missing Value Heatmap") plt.show() 🎨 Dark color (almost black / blue) → Value is NOT missing ✅ (data is present) ⚪ Light / white color → Value IS missing ❌ (NaN) 📑 𝗦𝘂𝗺𝗺𝗮𝗿𝘆 𝗧𝗮𝗯𝗹𝗲 (Clean Report) missing_report = pd.DataFrame({ "missing_count": df.isna().sum(), "missing_pct": df.isna().mean() * 100 }).sort_values(by="missing_pct", ascending=False) missing_report 🚀 Clean Data = Better Models 💯 Always handle missing values before training! #DataScience #MachineLearning #Python #DataAnalysis #GitHub #LearningJourney
Like Comment
To view or add a comment, sign in
Priscilla Nzula
1w
Report this post
🔷 My model could see all the right information. It was still getting things wrong. And I could not figure out why. Then I plotted the total_amount_spent column and saw the problem immediately. A few customers had spent 50,000 shillings. Most had spent between 500 and 3,000. The column was not a bell curve. It was a spike on the left and a long flat tail stretching to the right. The model was spending most of its energy trying to understand those big spenders at the far end. The regular customers in the middle were getting ignored. The data was right. The scale was wrong. So I transformed it. #import numpy as np 💠 df["amount_spent_log"] = np.log1p(df["total_amount_spent"]) log1p adds 1 before taking the log so zero values do not break everything. After transformation the distribution looked like a proper curve. The model could now treat the difference between a 500 shilling and a 1,000 shilling customer with the same attention it gave to the difference between a 20,000 and a 40,000 shilling customer. Same data. Completely different picture. That is feature transformation. You are not creating new columns. You are not extracting hidden ones. You are changing the shape of what already exists so the model can actually read it properly. • Engineering asks what new information can I create. • Extraction asks what hidden information can I uncover. • Transformation asks what shape does this information need to be in. 📍 All three are different tools. All three are necessary. Knowing which one your data needs is the skill. ❓ Have you ever had a model improve significantly just by transforming a column you already had? #DataScience #MachineLearning #Python
Like Comment
To view or add a comment, sign in
Arraxis

5 followers
4w Edited
Report this post
𝑴𝒐𝒔𝒕 𝒄𝒐𝒎𝒑𝒂𝒏𝒊𝒆𝒔 𝒔𝒕𝒐𝒓𝒆 𝒕𝒉𝒆𝒊𝒓 𝒅𝒂𝒕𝒂 𝒕𝒉𝒆 𝒘𝒓𝒐𝒏𝒈 𝒘𝒂𝒚. 𝑯𝒆𝒓𝒆'𝒔 𝒘𝒉𝒚 𝒊𝒕 𝒎𝒂𝒕𝒕𝒆𝒓𝒔. When you work with data in Python, you're likely using pandas. And pandas made a very deliberate choice: it stores data in 𝐜𝐨𝐥𝐮𝐦𝐧𝐬, not rows. This isn't a technical detail. It has real consequences for your team's speed and infrastructure costs. 𝐑𝐨𝐰 𝐬𝐭𝐨𝐫𝐚𝐠𝐞 (𝐡𝐨𝐰 𝐉𝐒𝐎𝐍 𝐰𝐨𝐫𝐤𝐬): Every record is a self-contained dictionary. Great for APIs and transactional systems — you always grab the full object. 𝐂𝐨𝐥𝐮𝐦𝐧𝐚𝐫 𝐬𝐭𝐨𝐫𝐚𝐠𝐞 (𝐡𝐨𝐰 𝐩𝐚𝐧𝐝𝐚𝐬 𝐰𝐨𝐫𝐤𝐬): Every column is a contiguous list. All ages together. All names together. All cities together. Why does this matter in practice? → 𝐒𝐩𝐞𝐞𝐝. When you calculate the average age of your customers, columnar storage loops over a single array of integers in memory. Row storage has to dig into each individual record, one by one. The difference at scale is enormous. → 𝐌𝐞𝐦𝐨𝐫𝐲. In row storage, the key "age" is repeated for every single row. In columnar storage, it's stored once. With millions of records, this adds up fast. → 𝐕𝐞𝐜𝐭𝐨𝐫𝐢𝐳𝐚𝐭𝐢𝐨𝐧. NumPy can apply operations to an entire column at C-level speed. With row-oriented data, you're stuck with Python-level loops. → 𝐂𝐨𝐦𝐩𝐫𝐞𝐬𝐬𝐢𝐨𝐧. Columns compress beautifully because similar values live next to each other. This is why formats like Parquet are so efficient for storage and I/O. The rule of thumb: - Building APIs or handling transactions? 𝐑𝐨𝐰-𝐨𝐫𝐢𝐞𝐧𝐭𝐞𝐝 𝐢𝐬 𝐟𝐢𝐧𝐞. - Running aggregations, filters, ML pipelines, or any analytical workload? 𝐂𝐨𝐥𝐮𝐦𝐧𝐚𝐫 𝐢𝐬 𝐭𝐡𝐞 𝐫𝐢𝐠𝐡𝐭 𝐭𝐨𝐨𝐥. If you're frequently converting pandas DataFrames back to JSON records (𝘥𝘧.𝘵𝘰_𝘥𝘪𝘤𝘵(𝘰𝘳𝘪𝘦𝘯𝘵='𝘳𝘦𝘤𝘰𝘳𝘥𝘴')), you're often leaving significant performance on the table. The data format you choose upstream shapes the cost and speed of every analysis downstream. Choose deliberately. At Arraxis, we help companies make practical decisions about how they store, structure, and use their data. #DataEngineering #Python #Pandas #DataStrategy #Analytics #BusinessIntelligence
Like Comment
To view or add a comment, sign in
Rajeev Kumar
1w
Report this post
📌𝗣𝘆𝘁𝗵𝗼𝗻 𝗟𝗶𝘀𝘁 𝗠𝗲𝘁𝗵𝗼𝗱𝘀 — 𝘄𝗵𝗮𝘁 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗵𝗮𝗽𝗽𝗲𝗻𝘀 𝗯𝗲𝗵𝗶𝗻𝗱 𝘁𝗵𝗲 𝘀𝗰𝗲𝗻𝗲𝘀- We think: → 𝗮𝗽𝗽𝗲𝗻𝗱() returns a new list ❌ → 𝗰𝗼𝗽𝘆() creates a deep copy ❌ → 𝘀𝗼𝗿𝘁() gives a new sorted output ❌ 𝗥𝗲𝗮𝗹𝗶𝘁𝘆? 𝗖𝗼𝗺𝗽𝗹𝗲𝘁𝗲𝗹𝘆 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝘁. And this is exactly why 𝘀𝗺𝗮𝗹𝗹 𝗺𝗶𝘀𝘁𝗮𝗸𝗲𝘀 𝘁𝘂𝗿𝗻 𝗶𝗻𝘁𝗼 𝗯𝗶𝗴 𝗱𝗮𝘁𝗮 𝗯𝘂𝗴𝘀. Let’s fix that 👇 🔹 𝗮𝗽𝗽𝗲𝗻𝗱(x) → Adds item to the end 💡 Modifies original list 🚫 Returns: None 🔹 𝗶𝗻𝘀𝗲𝗿𝘁(i, x) → Adds item at a specific index 💡 Keeps order control 🚫 Returns: None 🔹 𝗲𝘅𝘁𝗲𝗻𝗱(iterable) → Adds multiple items 💡 Used in merging datasets 🚫 Returns: None 🔹 𝗽𝗼𝗽([i]) → Removes + returns element 💡 Useful in pipelines & buffering ✅ Returns: removed item 🔹 𝗿𝗲𝗺𝗼𝘃𝗲(x) → Removes first occurrence ⚠️ Error if not found 🚫 Returns: None 🔹 𝗰𝗼𝗽𝘆() → Creates a shallow copy ⚠️ Nested objects still linked ✅ Returns: new list 🔹 𝗰𝗼𝘂𝗻𝘁(x) → Counts occurrences 💡 Helpful in validations ✅ Returns: integer 🔹 𝗶𝗻𝗱𝗲𝘅(x) → Finds position of value ⚠️ Error if not found ✅ Returns: index 🔹 𝗿𝗲𝘃𝗲𝗿𝘀𝗲() → Reverses list (in-place) 🚫 Returns: None 🔹 𝘀𝗼𝗿𝘁() → Sorts list (in-place) ⚠️ Doesn’t return a new list 🚫 Returns: None • Most list methods modify the original list • Only a few return values: 👉 𝗽𝗼𝗽() 👉 𝗰𝗼𝘂𝗻𝘁() 👉 𝗶𝗻𝗱𝗲𝘅() 👉 𝗰𝗼𝗽𝘆() 🔥 If you assume a return value where there is none… your pipeline will silently break. 👉 Which list method confused you the most before this? #Python #DataEngineering #LearnPython #CodingTips #ETL #DataAnalytics #TechContent
Like Comment
To view or add a comment, sign in
Karnulu Suresh
2w
Report this post
Headline: Stop wasting 4 hours on EDA. Do it in 4 lines of code. ⏳ Exploratory Data Analysis (EDA) is the most critical step in any data project, but let’s be honest—writing the same df.describe(), plt.scatter(), and sns.heatmap() code over and over is a soul-crushing time sink. In the industry, we use AutoEDA libraries to get 80% of the insights with 2% of the effort. 🚀 Here are my top 3 picks for your toolkit: 1️⃣ ydata-profiling (formerly Pandas Profiling): The "Gold Standard." It generates a massive, interactive HTML report covering correlations, missing values, and detailed stats for every column. 2️⃣ Sweetviz: The "Comparison King." Perfect for spotting Data Drift. If you need to see exactly how your Train set differs from your Test set, this is the tool. 3️⃣ AutoViz: The "Speed Demon." It automatically identifies the most important features and selects the best charts (Scatter, Box, Violin) for you. It’s incredibly fast, even on larger datasets. The Reality Check: ⚠️ Are these used for real-time streaming data? Usually, no. They are "batch" tools meant for the initial discovery phase or sanity-checking a new data dump. For live monitoring, you're better off with Grafana or Great Expectations. But for your next CSV or SQL export? Don't start from scratch. Automate the boring stuff so you can focus on the actual strategy. Which one is your go-to? Or are you still team Matplotlib/Seaborn for everything? 👇 #DataScience #Python #MachineLearning #Analytics #Efficiency #CodingTips
Like Comment
To view or add a comment, sign in
Priscilla Nzula
3w
Report this post
🔶drop_duplicates() catches exact copies. But real data has a sneakier problem that it completely misses. Same person. Slightly different entry. Same Employee ID. Employee_ID: 10234 | Name: John Kamau | Dept: Sales Employee_ID: 10234 | Name: J. Kamau | Dept: sales Those look different enough that drop_duplicates() won’t touch them. But they’re the same person entered twice. Here’s how to catch it: # Find IDs appearing more than once duplicate_ids = df[ df.duplicated(subset=[“Employee_ID”], keep=False) ] print(f“Records with duplicate IDs: {len(duplicate_ids)}”) print(duplicate_ids.sort_values(“Employee_ID”).head(20)) 🔷This shows every row that shares an ID with another row. Now you can actually investigate instead of guessing. The fix depends on what you find: # Keep only the most recent entry per employee df = df.sort_values(“date_added”, ascending=False) df = df.drop_duplicates(subset=[“Employee_ID”], keep=“first”) Soft duplicates are dangerous for one reason: your analysis treats one person as two data points. Your model learns from the same person twice. Your headcount reports are wrong from the start. And none of it raises an error. Everything looks fine. 📍Check for duplicates by key columns, not just identical rows. That extra step catches what the default function misses. ❓Have you ever found soft duplicates in a dataset? What gave it away? #DataCleaning #Python #DataScience
Like Comment
To view or add a comment, sign in
Bhavani Jaladi
3w
Report this post
Most people approach data analytics as a checklist of tools. That’s the wrong approach. High-quality work comes from understanding structure, not just execution. At the core sits business understanding. Everything else supports it. Data comes in. It gets cleaned. Then explored using SQL or Python. Findings are shaped into visuals. Finally, those visuals are turned into decisions. Add AI on top, and the speed increases. But clarity still depends on how well the foundation is built. Here’s where most go wrong: They jump straight to dashboards. They skip context. They ignore data quality. The result looks good, but fails in real decisions. Strong analysts don’t work in steps. They think in systems. Every part connects. Every layer affects the outcome. If one piece is weak, everything built on top of it becomes unreliable. That’s the difference between reporting numbers and driving decisions. Your weakest link? #dataanalytics #businessanalytics #datascience #datavisualization #powerbi #sql #python #aiforbusiness #datastorytelling
Like Comment
To view or add a comment, sign in
Nasiff Kazeem
2w
Report this post
Day 19 — Merging & Joining Data in Pandas As I continue deepening my understanding of pandas, today’s focus was on something very practical: combining datasets. In real-world scenarios, data rarely comes in a single clean table. You often have multiple datasets that need to be brought together before any meaningful analysis can happen. That’s where pandas functions like merge(), join(), and concat() come in. Here’s a quick breakdown of what I learned: 🔹 merge() This is similar to SQL joins. It allows you to combine datasets based on a common column. You can perform: Inner joins Left joins Right joins Outer joins Example: pd.merge(df1, df2, on="id", how="inner") 🔹 join() Used mainly for combining DataFrames based on their index. It’s a bit more concise when working with indexed data. 🔹 concat() Used to stack DataFrames either: Vertically (adding more rows) Horizontally (adding more columns) Example: pd.concat([df1, df2], axis=0) 💡 Key Insight: Understanding when to use each method is crucial. Use merge() when working with relational data Use concat() when stacking data Use join() for index-based alignment This concept is especially important in data cleaning and preprocessing, where datasets often come from different sources. Each day, pandas feels less like a tool and more like a language for working with data. #M4aceLearningChallenge #Day19 #DataScience #MachineLearning #Python #Pandas #DataAnalysis
Like Comment
To view or add a comment, sign in
Dwiti Bhavsar
1w
Report this post
Correlation tells you what moved together. Causal inference tells you what actually caused it. After this, you'll be able to estimate the true causal effect of any intervention : a promo, a product change, a policy shift - from observational data. No A/B test required. The technique: Propensity Score Matching (PSM) in Python. 𝗦𝘁𝗲𝗽 𝟭 :𝗜𝗻𝘀𝘁𝗮𝗹𝗹 ```bash pip install causalinference ``` 𝗦𝘁𝗲𝗽 𝟮 :𝗣𝗿𝗲𝗽𝗮𝗿𝗲 𝘆𝗼𝘂𝗿 𝗱𝗮𝘁𝗮 You need three columns: outcome Y, binary treatment D, and confounders X. ```python import pandas as pd df = pd.read_csv("observational_data.csv") Y = df["revenue"].values D = df["received_promo"].values # 1 = treated, 0 = control X = df[["age", "tenure", "spend_last_90d"]].values ``` 𝗦𝘁𝗲𝗽 𝟯 : 𝗕𝘂𝗶𝗹𝗱 𝗮𝗻𝗱 𝗿𝘂𝗻 𝘁𝗵𝗲 𝗺𝗼𝗱𝗲𝗹 ```python from causalinference import CausalModel model = CausalModel(Y, D, X) model.est_via_matching() print(model.estimates) ``` 𝗦𝘁𝗲𝗽 𝟰 : Read your results The key output is ATE (Average Treatment Effect) - the estimated causal lift, adjusted for selection bias. 📌 Always run `model.summary_stats` first. If treated and control groups don't overlap in propensity score distribution, your estimate is invalid — check covariate balance before trusting any number. The result: instead of "promo users had 23% higher revenue," you can say "the promo caused a £42 average revenue lift, controlling for age and prior spend." That's a claim your finance team can't easily dismiss. Have you applied causal inference in a real project? What's the hardest part to justify to non-technical stakeholders? #DataAnalytics #Data #Python #DataScience #Analytics #Statistics #CausalInference #BusinessIntelligence
Like Comment
To view or add a comment, sign in

1,188 followers

26 Posts

View Profile Connect

Cleaning Phone Numbers with Pandas in Python

More Relevant Posts

Explore related topics

Explore content categories