Data Cleaning in Python for Data Analysts

Day 4 of My Data Analyst Journey – Data Cleaning in Python Today, I practiced data cleaning techniques using Python, focusing on handling real-world messy text data. Problem Statement: I had a dataset of customer feedback containing: • Extra spaces • Mixed casing (UPPER/lower) • Punctuation (., !, ?) Objective: Clean and standardize the feedback text for better analysis. What I implemented: Removed punctuation using .replace() Converted text to lowercase Removed leading & trailing spaces using .strip() Handled lists inside a dictionary Python Code: import string feedback_data = { 'S_No': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'Name': ['Ravi', 'Meera', 'Sam', 'Anu', 'Raj', 'Divya', 'Arjun', 'Kiran', 'Leela', 'Nisha'], 'Feedback': [ ' Very GOOD Service!!!', 'poor support, not happy ', 'GREAT experience! will come again.', 'okay okay...', ' not BAD', 'Excellent care, excellent staff!', 'good food and good ambience!', 'Poor response and poor handling of issue', 'Satisfied. But could be better.', 'Good support... quick service.' ], 'Rating': [5, 2, 5, 3, 2, 5, 4, 1, 3, 4] } punctuation = ".,!?" cleaned_feedbackdata = {} for key, value in feedback_data.items(): if isinstance(value, list): new_list = [] for item in value: if isinstance(item, str): item = item.strip().lower() for p in punctuation: item = item.replace(p, "") new_list.append(item) cleaned_feedbackdata[key] = new_list else: cleaned_feedbackdata[key] = value print(cleaned_feedbackdata) Outcome: Cleaned and structured feedback data ready for analysis like sentiment detection, keyword extraction, and insights generation. Key Learning: Data cleaning is one of the most important steps in data analysis—clean data = better insights! #Python #DataCleaning #DataAnalytics #LearningJourney #BeginnerToPro #CodingPractice #100DaysOfCode

To view or add a comment, sign in

More Relevant Posts

Ishant Bhardwaj
5d
Report this post
🚀 Strings & String Methods in Python #Day31 If variables are containers, strings are how Python stores and handles text data. Names, emails, passwords, customer data, file paths, web scraping, data cleaning — strings are everywhere. 🔹 What is a String? A string is a sequence of characters enclosed in quotes. name = "Harry" city = 'Delhi' Both single and double quotes work the same. Strings can contain: ✅ Letters ✅ Numbers (as text) ✅ Symbols ✅ Spaces "Python" "12345" "Hello @2026" 🔹 Multiline Strings Use triple quotes for text spanning multiple lines: message = """This is a multi line string""" Useful for documentation, SQL queries, or long messages. 🔹 String Indexing Each character has a position (index). text = "Python" P y t h o n 0 1 2 3 4 5 print(text[0]) # P print(text[3]) # h ⚡ Indexing starts from 0. Python also supports negative indexing: text[-1] # n text[-2] # o Very useful when working from the end of a string. ✂️ String Slicing Slicing extracts a portion of a string. text[0:3] # Pyt text[2:] # thon text[:4] # Pyth Negative slicing: text[-3:] # hon Powerful and widely used in data manipulation. 🔹 len() Function Find the length of a string: len("Python") Output: 6 Even spaces are counted. 🛠 Common String Methods 1. lower() and upper() "PYTHON".lower() "python".upper() Useful for standardizing text. 2. strip() Removes extra spaces: " hello ".strip() Great for cleaning raw data. 3. replace() "Hello World".replace("World","Python") Output: Hello Python 4. split() Turns a string into a list: "apple,banana,orange".split(",") Used heavily in data parsing. 5. join() Opposite of split: ",".join(["apple","banana","orange"]) 6. find() Find position of text: "Hello World".find("World") Returns index or -1 if not found. 7. startswith() and endswith() email.endswith(".com") email.startswith("test") Very useful in validation. 🔍 Checking String Content isalpha() isdigit() isalnum() Examples: "Python".isalpha() "123".isdigit() "Python123".isalnum() Useful for validation logic. 🔄 Strings Are Immutable Important concept: text="Python" text[0]="J" ❌ Error Strings cannot be modified directly. Any change creates a new string. 💡 Why Strings Matter in Data Analytics Strings are everywhere in analytics: 📌 Cleaning messy datasets 📌 Working with CSV files 📌 Parsing emails & text 📌 Filtering data 📌 Web scraping 📌 Text analysis Mastering strings makes data cleaning much easier. Python strings may look simple, but they’re one of the most powerful tools in programming. #Python #PythonProgramming #DataAnalytics #PowerBI #Excel #MicrosoftPowerBI #MicrosoftExcel #DataAnalysis #DataAnalysts #CodeWithHarry #DataVisualization #DataCollection #DataCleaning
Like Comment
To view or add a comment, sign in
Obiageli Innocent
3w
Report this post
Day 12/30 - Nested Data Structures in Python Today everything clicked. Lists, dicts, tuples. They don't live separately. Real data nests them together. What is Nesting? Nesting means placing one data structure inside another. A list can contain dictionaries. A dictionary can contain lists. A dictionary can even contain other dictionaries. This is how Python represents complex, real-world data - the same structure used in JSON APIs, databases, and config files. Four Common Nesting Patterns List inside Dict -> a dictionary key holds a list as its value e.g. a student's list of scores Dict inside List -> a list contains multiple dictionaries e.g. a list of student records Dict inside Dict -> a key holds another dictionary e.g. a user with a nested address object List inside List -> a list contains other lists e.g. rows and columns in a grid or table How to Access Nested Data You access nested data by chaining brackets one for each level you go deeper: data["student"]["scores"][0] -->open dict , go to scores key, grab index 0 Rule: count the levels of nesting, then use that many brackets to reach the value. Looping Through Nested Structures When your data is a list of dictionaries, use a for loop to go through each dictionary, then use bracket notation to pull out values. This is the most common real-world pattern- reading records from an API or database. Code Example 1: List Inside a Dict python student = { "name" : "Obiageli", "scores": [88, 92, 75, 95], "passed": True } print(student["scores"]) = [88, 92, 75, 95] print(student["scores"][0]) = 88 print(student["scores"][-1]) = 95 Key Learnings ☑ Nesting = placing one data structure inside another ☑ Access nested data by chaining brackets , one bracket per level ☑ A list of dictionaries is the most common pattern, it's how API and database data looks ☑ Use a for loop to go through a list of dicts and pull values from each record ☑ Nested structures are the foundation of JSON -master this and real-world data won't feel foreign My Takeaway Nested data structures are where all the previous days connect. Lists, tuples, sets, dictionaries - they don't live in isolation. Real data combines all of them. Today I started seeing data the way Python sees it. #30DaysOfPython #Python #LearnToCode #CodingJourney #WomenInTech
Like Comment
To view or add a comment, sign in
Nischal Karki
2w
Report this post
Day 2 of Learning Python – And I Just Built My First Real Data Audit System 📊🐍 Today I didn’t just “learn Python”… I used it to analyze structured company-style audit data and built a Mistake Scoring System that automatically evaluates performance. And honestly, It felt like stepping into real business intelligence work. 💡 What I built today: Using Pandas, I processed an audit dataset and generated insights like: 📌 Total deals per responsible person 📌 Pipeline distribution per team member 📌 Mistake scoring based on missing actions (follow-ups, updates, documents) 📌 Final performance summary ranking everyone by errors ⚙️ The idea behind the system: Instead of manually checking performance, I created a logic-based scoring system where: Missing documents = +1 error No follow-up = +1 error No comment update = +1 error Unresolved status = +3 heavy penalty This turns raw data into actionable performance insights. 💻 Code I used: import pandas as pd file_path = r " Instered your excel data file here" Note: The r before the file path means it is a raw string, which helps Python correctly read the path without treating backslashes as escape characters. Also, make sure your Excel file is saved in the same folder where your Python script is located, or ensure the correct full file path is provided. df = pd.read_excel(file_path) # CLEAN DATA df.columns = df.columns.str.strip() df = df.fillna("No") # MISTAKE SCORE SYSTEM df["Mistake Score"] = 0 df.loc[df["Document/RF Request"] == "No", "Mistake Score"] += 1 df.loc[df["Comment Updates"] == "No", "Mistake Score"] += 1 df.loc[df["Follow up"] == "No", "Mistake Score"] += 1 df.loc[df["Status"].str.lower() == "unresolved", "Mistake Score"] += 3 # ANALYSIS print(df["Responsible"].value_counts()) print(df.groupby(["Responsible", "Pipeline"]).size()) mistakes = df.groupby("Responsible")["Mistake Score"].sum().sort_values(ascending=False) print(mistakes) summary = df.groupby("Responsible").agg( Total_Deals=("Responsible", "count"), Total_Mistakes=("Mistake Score", "sum") ) print(summary.sort_values("Total_Mistakes", ascending=False)) 🚀 Key takeaway: Even simple Python + Excel data can be transformed into a decision-making system that highlights performance gaps instantly. Day 2 of learning — and I’m already seeing how powerful data can be in real business environments. Can’t wait to build dashboards and automate even more next 🔥 #Python #DataAnalysis #Pandas #LearningInPublic #DataScience #Automation #BusinessIntelligence #CareerGrowth
Like Comment
To view or add a comment, sign in
Manish Mohapatra
1w
Report this post
📊 Detecting & Treating Outliers in Python - The Data Points That Can Mislead You You’ve cleaned missing values. Your dataset looks fine. But there’s one more hidden problem most beginners miss: And that is outliers And sometimes, just one outlier can completely distort your analysis. 🔹 Why Do Outliers Matter? Because they can quietly break your results: ❌ Skew averages ❌ Mislead insights ❌ Affect visualizations ❌ Reduce model accuracy 👉 One extreme value = one wrong conclusion What is an Outlier? An outlier is a data point that is significantly different from the rest of the data. It can be extremely high. Or extremely low. Either way — it does not represent the typical pattern. Examples from real data: An employee with a salary of ₹500 in a company where average salary is ₹60,000 A customer who ordered 9,000 units when everyone else ordered between 5 and 50 An age value of 150 in a health dataset These are not just unusual — they are dangerous to your analysis if left untreated. Step 1 — Detect Outliers Visually Always start by looking at the data. import seaborn as sns # Box plot to spot outliers visually sns.boxplot(x=df['salary']) A box plot immediately shows you which values fall far outside the normal range. Any dot beyond the whiskers — that is your outlier. Step 2 — Detect Outliers Using IQR Method The IQR (Interquartile Range) method is the most reliable way to detect outliers mathematically. Q1 = df['salary'].quantile(0.25) Q3 = df['salary'].quantile(0.75) IQR = Q3 - Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR # Find outliers outliers = df[(df['salary'] < lower) | (df['salary'] > upper)] print(outliers) Anything below the lower limit or above the upper limit is flagged as an outlier. Step 3 — Treat the Outliers Now you have three choices depending on your situation. Remove them — when the outlier is clearly an error. df = df[(df['salary'] >= lower) & (df['salary'] <= upper)] Cap them — replace extreme values with the boundary limit. df['salary'] = df['salary'].clip(lower=lower, upper=upper) Replace with median — when you want to keep the row but fix the value. median = df['salary'].median() df['salary'] = df['salary'].apply( lambda x: median if x < lower or x > upper else x ) How to Decide Which Method to Use Situation Best Approach Value is a data entry error Remove it Value is extreme but possible Cap it You cannot afford to lose rows Replace with median Here is the truth no one tells beginners. Outliers are not always mistakes. Sometimes they are the most interesting part of your data — the customer who spends the most, the employee who performs the best, the product that sells far beyond expectations. Your job is not to blindly remove them. Your job is to understand them first — then decide. That is what separates a careful analyst from a careless one. 💡 #DataAnalytics #Python #DataCleaning #Outliers #DataAnalyst #LearningData
Like Comment
To view or add a comment, sign in
anuj chhetri
3w
Report this post
My Data Science Journey — Python Tuple, Set, Dictionary & the Collections Library Today’s focus was on Python’s core data structures — Tuples, Sets, and Dictionaries — along with the powerful collections module that enhances their functionality for real-world use cases. 𝐖𝐡𝐚𝐭 𝐈 𝐋𝐞𝐚𝐫𝐧𝐞𝐝: Tuple – Ordered, immutable, allows duplicates – Single element tuples require a trailing comma → ("cat",) – Supports packing and unpacking → x, y = 10, 30 – Cannot be modified after creation (TypeError by design) – Faster than lists in certain operations – Used in scenarios like geographic coordinates and fixed records – Can be used as dictionary keys (unlike lists) Set – Unordered, mutable, stores unique elements only – No indexing or slicing support – Empty set must be created using set() ({} creates a dict) – .remove() raises KeyError if element not found – .discard() removes safely without error – Supports operations like union, intersection, difference, symmetric_difference – Methods like issubset(), issuperset(), isdisjoint() help in set comparisons – frozenset provides an immutable version of a set – Offers O(1) average time complexity for membership checks Dictionary – Key-value pair structure, ordered, mutable, and keys must be unique – Built on hash tables for fast lookups – user["key"] → raises KeyError if missing – user.get("key", default) → safe access with fallback – Methods: keys(), values(), items() for iteration – pop(), popitem(), update(), clear(), del for modifications – Widely used in real-world data like APIs and JSON responses – Common pattern: list of dictionaries for structured datasets Collections Library – namedtuple → tuple with named fields for better readability – deque → efficient queue with O(1) operations on both ends – ChainMap → combines multiple dictionaries without merging copies – OrderedDict → maintains order with additional utilities like move_to_end() – UserDict, UserList, UserString → useful for customizing built-in behaviors with validation and extensions Performance Insight – List → O(n) – Tuple → O(n) – Set → O(1) (average lookup) – Dictionary → O(1) (average lookup) 𝐊𝐞𝐲 𝐈𝐧𝐬𝐢𝐠𝐡𝐭: Understanding when to use each data structure — and how collections enhances them — is crucial for writing efficient, scalable, and clean Python code. Read the full breakdown with examples on Medium 👇 https://lnkd.in/gvv5ZBDM #DataScienceJourney #Python #Tuple #Set #Dictionary #Collections #Programming #DataStructures

Python — Tuple, Set, Dictionary & the collections Library: A Complete Guide medium.com
Like Comment
To view or add a comment, sign in
SELVASUNDAR RAJAN
1w
Report this post
✅ *Top Python Interview Q&A - for Data Science Roles* 🌱 *1️⃣ What is Pandas and why use it?* Pandas is Python's most popular library for data analysis and manipulation. It provides DataFrames (Excel-like tables) and Series (columns). Perfect for cleaning, transforming, analyzing CSV/Excel data. ``` import pandas as pd df = pd.read_csv('sales.csv') # Load data print(df.head()) # First 5 rows print(df.shape) # Rows, columns ``` *2️⃣ How do you load a CSV file into Pandas?* Use pd.read_csv(). Most common data source in interviews. Handles large files efficiently. ``` df = pd.read_csv('data.csv') # Common options: df = pd.read_csv('data.csv', sep=';', encoding='utf-8', nrows=1000) ``` *3️⃣ What is the difference between DataFrame and Series?* DataFrame = table (rows + columns) Series = single column DataFrame has 2D structure, Series is 1D. ``` df = pd.DataFrame({'A': [1,2], 'B': [3,4]}) # DataFrame series = df['A'] # Series print(type(df)) # <class 'pandas.core.frame.DataFrame'> print(type(series))# <class 'pandas.core.series.Series'> ``` *4️⃣ How do you check basic info about DataFrame?* Use info(), describe(), head(), tail(), shape, columns. Essential for data exploration. ``` df.info() # Data types, memory, missing values df.describe() # Stats (mean, std, min, max) print(df.head(3)) # First 3 rows print(df.shape) # (1000, 5) print(df.columns) # Index(['name', 'age', 'city']) ``` *5️⃣ How do you select single column from DataFrame?* Use df['column_name'] or df.column_name (if no spaces). Returns Series. ``` df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]}) names = df['name'] # Series ages = df.age # Same Series print(names[0]) # Alice ``` *6️⃣ How do you filter rows based on condition?* Use boolean indexing. Most common data selection method. ``` # Age > 25 high_age = df[df['age'] > 25] # Multiple conditions adult_male = df[(df['age'] > 18) & (df['gender'] == 'M')] ``` *7️⃣ How do you add a new column to DataFrame?* Simple assignment. Creates column with same length as rows. ``` df['bonus'] = df['salary'] * 0.1 # 10% bonus df['high_earner'] = df['salary'] > 50000 # Boolean column df['name_length'] = df['name'].str.len() # String length ``` *8️⃣ How do you sort DataFrame by column?* Use sort_values(). ascending=False for descending. Common for ranking. ``` # Sort by salary (descending) df.sort_values('salary', ascending=False, inplace=True) # Multiple columns df.sort_values(['department', 'salary'], ascending=[True, False]) ``` *9️⃣ How do you check for missing values?* isnull().sum() gives count per column. Critical first step in data cleaning. ``` print(df.isnull().sum()) # age 5 # salary 0 # city 10 print(df.isna().sum()) # Same as isnull() ```
Like Comment
To view or add a comment, sign in
Shravan Kharade
4w
Report this post
🚀 Day 7 of My Python Learning Journey | String Methods | Business Analyst Aspirant Continuing my Python journey to strengthen my skills for a Business Analyst role 📊 Today, I worked on String Methods in Python, which are extremely useful for data cleaning, transformation, and preprocessing — key tasks in real-world analytics. 💻 Topic: String Methods in Python # Remove spaces text1 = " hello python learners " print("Clean text:", text1.strip()) # Upper & Lower case print("Upper:", text1.upper().strip()) print("Lower:", text1.lower().strip()) # Replace text print("Replace:", text1.replace("python", "SQL").strip()) # Count occurrences print("Count of 'o':", text1.count("o")) # Check start print("Starts with hello:", text1.strip().startswith("hello")) # Check numeric mobile = "9876543210" print("Is numeric:", mobile.isnumeric()) # Split & Join msg = "Welcome to python Course" words = msg.split() print("Words list:", words) joined_text = "_".join(words) print("Joined text:", joined_text) # Find position print("Index of 'p':", msg.find("p")) # Extract domain email = "student@example.com" domain = email[email.find("@") + 1:] print("Domain:", domain) # Data Cleaning Example (Price) price_text = "Price : ₹3500/-" clean_price = price_text.replace("Price :", "")\ .replace("₹", "")\ .replace("/-", "")\ .strip() print("Clean price:", clean_price) 💡 Key Learnings: Cleaned raw text data using strip() and replace() Transformed text using upper(), lower(), split(), and join() Extracted useful information (like email domain) Practiced real-world data cleaning (price formatting) 📌 These skills are directly applicable in: ✔ Data Cleaning ✔ Excel / SQL transformations ✔ Power BI datasets I’m learning Python through Satish Dhawale sir course (SkillCourse) and practicing daily 💻 🔥 Next step: Applying these concepts on real datasets and analytics projects Let’s connect if you're also learning Python or Data Analytics 🤝 #Python #StringMethods #DataCleaning #BusinessAnalyst #DataAnalytics #LearningJourney #SkillDevelopment #SatishDhawale #SkillCourse #UpGrad
Like Comment
To view or add a comment, sign in
Shravan Kharade
3w
Report this post
🚀 Day 16 of My Python Learning Journey | Sets | Business Analyst Aspirant Continuing my journey towards becoming a Business Analyst, I explored Sets in Python — a powerful data structure used for handling unique data and performing comparisons 📊 Sets are extremely useful in real-world analytics for removing duplicates and comparing datasets. 💻 Topic: Sets in Python # Create a set (duplicates automatically removed) fruits = {"Apple", "Banana", "Apple", "Mango"} print(fruits) # Add item fruits.add("Orange") print(fruits) # Remove item fruits.discard("Banana") print(fruits) # Set Operations a = {1, 2, 3} b = {3, 4, 5} print("Union:", a | b) print("Intersection:", a & b) print("Difference:", b - a) print("Symmetric Difference:", a ^ b) # Remove duplicates from list cities = ["Mumbai", "Pune", "Delhi", "Mumbai"] unique = set(cities) print("Unique cities:", unique) # Find missing values list1 = {"SQL", "Excel", "Power BI"} list2 = {"SQL", "Power BI"} missing = list1 - list2 print("Missing:", missing) # Find common skills deptA = {"SQL", "Excel", "Python"} deptB = {"Excel", "Python", "Power BI"} print("Common skills:", deptA & deptB) 💡 Key Learnings ✔ Learned how to store unique values using sets ✔ Removed duplicates efficiently ✔ Performed set operations (union, intersection, difference) ✔ Compared datasets to find missing and common values 📌 Why Sets are Important for Business Analytics Sets are used in: 🔹 Data cleaning (removing duplicates) 🔹 Comparing datasets (common / missing values) 🔹 Skill gap analysis 🔹 Data validation and filtering I’m learning Python through Satish Dhawale sir (SkillCourse) and practicing daily 💻 🔥 Next step: Applying these concepts on real datasets and analytics problems Let’s connect if you're also learning Python or Data Analytics 🤝 🔥 Hashtags #Python #Sets #DataAnalytics #BusinessAnalyst #LearningJourney #DataCleaning #PythonProgramming #Automation #Upskilling #SQL #PowerBI #SatishDhawale #SkillCourse #UpGrad #DataScience #BusinessAnalytics #DataProcessing #CareerGrowth #AnalyticsJourney #ProblemSolving
Like Comment
To view or add a comment, sign in
J V Pavan
4d
Report this post
PySpark code is a classic implementation of a Reliable Streaming Pipeline ⚙️ Phase 1: The Continuous Engine This part of the code tells Databricks to keep the "engine" running 24/7. Python (spark.readStream.table("source_append_table") .filter("(status IS NULL) AND (record_type = 'file_type')") .writeStream .foreachBatch(load_all_and_route_errors) # Calls the logic below .option("checkpointLocation", "/mnt/delta/checkpoints/dual_target_load") .trigger(processingTime='10 seconds') # ✅ Makes it run continually .start() ) 🛠️ Phase 2: The Validation & Routing Function This is the internal logic (load_all_and_route_errors) that runs every time new data is detected. 1. Persisting Data (The Memory Guard) 💾 Python microBatchDf.persist() Icon: 🧠 Action: Saves the incoming data in RAM. Why: Since we are writing to two tables (Main and Error), we don't want Spark to do the work twice. Caching it here makes the job twice as fast. 2. The Validation Engine (The Inspector) ⚖️ Python errors = F.array_remove(F.array( F.when(F.col("order_id").isNull(), "Missing order_id"), F.when(F.col("price") < 0, "Negative price") ), None) Icon: 🔍 Action: Captures WHY it failed. It creates a list of errors for every row. Note: Unlike simple filters, this ensures you have an audit trail of reasons for every bad record. 3. Flagging the Data 🏷️ Python validated_df = (microBatchDf .withColumn("validation_status", F.when(F.size(errors) > 0, "Invalid").otherwise("Valid")) ) Icon: 🚩 Action: Tags every single row as either Valid or Invalid based on the results of the Validation Engine. 🍴 Phase 3: The Fork in the Road (Dual Write) Path A: The Clean Production Table ✅ Python only_valid_records = validated_df.filter("validation_status = 'Valid'") (only_valid_records.write .format("delta") .mode("append") .saveAsTable("main_target_table")) Icon: 🏦 Strategy: Only rows with zero errors move forward. This keeps your business dashboards clean and trustworthy. Path B: The Quarantine/Error Table 🚨 Python invalid_records = validated_df.filter("validation_status = 'Invalid'") if not invalid_records.isEmpty(): (invalid_records.write .format("delta") .mode("append") .saveAsTable("error_records_table")) Icon: 🚧 Strategy: Redirects bad data to a separate log. Because we captured the reasons, engineers can immediately see that "Row X failed because of a negative price." 🧹 Phase 4: Final Cleanup Python microBatchDf.unpersist() Icon: 🧼 Action: Clears the memory block. Why: In a continually running job, if you forget this, your cluster memory will fill up over time and eventually crash (OOM error). 💡 Summary of "Continuous" Best Practices Use Job Clusters: In Databricks, run this as a "Continuous Job" type so Databricks automatically restarts it if the cloud provider has a hiccup. final tip: since you are now running this continually, ensure your cluster is sized correctly for a 24/7 workload!
Like Comment
To view or add a comment, sign in
Vivek Bhave
2w Edited
Report this post
𝗖𝗮𝗻 𝗦𝗤𝗟 𝗱𝗼 𝗳𝗲𝗮𝘁𝘂𝗿𝗲 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀? We usually do feature analysis in Python, but what if we cannot load millions of rows in Python? Can we do that with SQL? To figure this out, I took the problem of customer churn and tried to understand why customers are leaving and what we can do about it. For this, I tried to understand the behavior of churned customers across the different groups of each feature. For example, does a high number of support calls lead to churning? To study customer behavior, I calculated the churn rates across the groups of each feature using AVG() in SQL. I used churn rate because it allows comparison irrespective of group size. For calculating the churn rate for numerical features like payment delay, I first divided this feature into groups using GROUP BY in SQL. I did this by identifying the sudden difference in churn rates between two values. Consequently, I identified the thresholds of behavioral change and labeled the groups using a CASE conditional statement. For categorical features, it can be easily calculated. To decide which feature is important, I used this criteria: 1. The churn rate difference must be significant for at least one group compared to others. This suggests that after this threshold is the breaking point of customer behavior. 2. The pattern should be stable, to avoid random noise. 3. Group sizes should be comparable. Example: Issue Level (Support Calls) +------------------+------------------+ | Issue Level | Churn Rate | +------------------+------------------+ | Low | 0.10 | | Medium | 0.25 | | High | 0.80 | +-------------------+-----------------+ Churn rate stays stable across low and medium but increases sharply at high issue level. Customers waited patiently until the support calls were in the medium issue level. Once the threshold is crossed, 80% of the customers leave. That means one should respond to support calls before reaching the high issue level; otherwise, the customer will leave. In customer churn, the features are: Age, Gender, Tenure, Usage Frequency, Support Calls, Payment Delay, Subscription Type, Contract Length, Total Spend, Last Interaction, and Churn. For more detailed analysis, check out github repo (Notebooks/SQL_Analysis folder): https://lnkd.in/gUx9vgyE #SQL #FeatureAnalysis #CustomerChurn #DataAnalytics #DataScience #SQLAnalytics #ChurnAnalysis #DataEngineering #BehavioralAnalysis #AnalyticsEngineering #BigData #DataCommunity
Like Comment
To view or add a comment, sign in

404 followers

8 Posts

View Profile Connect

Data Cleaning in Python for Data Analysts

More Relevant Posts

Explore content categories