Name: Handling Missing Values & Duplicate Data in Pandas | Dnyaneshwari Jakore posted on the topic | LinkedIn
Uploaded: 2026-04-19T17:04:11.559Z
Duration: 1 min 6 s
Channel: Dnyaneshwari Jakore

Dnyaneshwari Jakore

🚀 𝐃𝐚𝐲 𝟔 : 🔥 𝐒𝐮𝐩𝐞𝐫 𝐞𝐱𝐜𝐢𝐭𝐞𝐝 𝐭𝐨 𝐬𝐡𝐚𝐫𝐞 𝐭𝐨𝐝𝐚𝐲’𝐬 𝐯𝐞𝐫𝐲 𝐢𝐦𝐩𝐨𝐫𝐭𝐚𝐧𝐭 𝐭𝐨𝐩𝐢𝐜! 𝐓𝐨𝐝𝐚𝐲 𝐰𝐞 𝐚𝐫𝐞 𝐝𝐢𝐯𝐢𝐧𝐠 𝐢𝐧𝐭𝐨 𝐨𝐧𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐦𝐨𝐬𝐭 𝐜𝐫𝐮𝐜𝐢𝐚𝐥 𝐬𝐭𝐞𝐩𝐬 𝐢𝐧 𝐝𝐚𝐭𝐚 𝐩𝐫𝐞𝐩𝐫𝐨𝐜𝐞𝐬𝐬𝐢𝐧𝐠 . 📊 𝐇𝐚𝐧𝐝𝐥𝐢𝐧𝐠 𝐌𝐢𝐬𝐬𝐢𝐧𝐠 𝐕𝐚𝐥𝐮𝐞𝐬 & 𝐃𝐮𝐩𝐥𝐢𝐜𝐚𝐭𝐞 𝐃𝐚𝐭𝐚 𝐢𝐧 𝐏𝐚𝐧𝐝𝐚𝐬: In real-world datasets, data is never perfect. You will always face: ❌ Missing values (NaN) ❌ Duplicate records And if we don’t handle them properly, it can completely affect our analysis, dashboards, and insights. 📌 𝟏. 𝐇𝐚𝐧𝐝𝐥𝐢𝐧𝐠 𝐌𝐢𝐬𝐬𝐢𝐧𝐠 𝐕𝐚𝐥𝐮𝐞𝐬 Missing values need careful treatment before analysis. 🔹 Check Missing Values df.isnull().sum() 🔹 Remove Missing Data df.dropna() df.dropna(axis=1) 🔹 Fill Missing Data df.fillna(0) 📌 𝟐. 𝐇𝐚𝐧𝐝𝐥𝐢𝐧𝐠 𝐃𝐮𝐩𝐥𝐢𝐜𝐚𝐭𝐞 𝐃𝐚𝐭𝐚 Duplicate rows can mislead KPIs and reporting accuracy. 🔹 Find Duplicates df.duplicated() df.duplicated().sum() 🔹 View Duplicates df[df.duplicated()] 🔹 Remove Duplicates df.drop_duplicates() Data cleaning is not just a step — it is the foundation of every successful analysis. 🚀 Feeling excited to continue this learning journey step by step! #DataAnalytics #Python #Pandas #DataCleaning #MissingValues #DuplicateData

To view or add a comment, sign in

More Relevant Posts

Momina Zaheer
1w
Report this post
🚀 Day 14: Building My First Complete Data Analysis Workflow Today I worked on a complete mini data analysis project, combining everything I’ve learned so far in my Data Science journey. 📊 Project: Dataset Analysis using Pandas & Matplotlib 📌 What I did: ->Loaded a real dataset using Pandas ->Explored the data structure and summary ->Handled missing values ->Performed basic analysis ->Visualized results using charts 💻 Concepts Used: ->Data cleaning ->Data analysis ->Data visualization ⚠️ Challenge I faced: Handling missing data correctly and deciding what to fill required careful thinking. 💡 Example from my code: df["Age"].fillna(df["Age"].mean(), inplace=True) 📊 Key Insight: Data becomes meaningful only after cleaning and visualizing—it’s not just about numbers. 🎯 Next Step: Working on more structured projects and improving analytical thinking. 📌 Would appreciate suggestions: What should be my next step to improve as a beginner in Data Science? #Day14 #DataScience #Python #Pandas #Matplotlib #Projects #LearningJourney

4 Comments
Like Comment
To view or add a comment, sign in
Sudeesh Koppisetti
3w
Report this post
𝐃𝐚𝐭𝐚 𝐀𝐧𝐚𝐥𝐲𝐬𝐭 𝟗𝟎 — 𝐃𝐚𝐲 9: 𝐔𝐧𝐝𝐞𝐫𝐬𝐭𝐚𝐧𝐝𝐢𝐧𝐠 𝐭𝐡𝐞 𝐑𝐨𝐥𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐃𝐚𝐭𝐚 𝐀𝐧𝐚𝐥𝐲𝐬t 𝐇𝐚𝐧𝐝𝐥𝐢𝐧𝐠 𝐎𝐮𝐭𝐥𝐢𝐞𝐫𝐬 𝐢𝐧 𝐃𝐚𝐭𝐚 𝐀𝐧𝐚𝐥𝐲𝐬𝐢𝐬 𝐎𝐮𝐭𝐥𝐢𝐞𝐫𝐬— extreme or unusual values — can heavily influence analysis results if not handled correctly. Identifying and managing them is essential for building reliable and trustworthy insights. 🔍 𝐇𝐨𝐰 𝐭𝐨 𝐈𝐝𝐞𝐧𝐭𝐢𝐟𝐲 𝐎𝐮𝐭𝐥𝐢𝐞𝐫𝐬 𝐕𝐢𝐬𝐮𝐚𝐥 𝐦𝐞𝐭𝐡𝐨𝐝𝐬: Box plots, scatter plots, and histograms help spot unusual patterns at a glance. 𝐒𝐭𝐚𝐭𝐢𝐬𝐭𝐢𝐜𝐚𝐥 𝐭𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞𝐬: Methods like Z-scores and the Interquartile Range (IQR) highlight values that fall far from the normal range. 𝐑𝐞𝐦𝐨𝐯𝐢𝐧𝐠 𝐎𝐮𝐭𝐥𝐢𝐞𝐫𝐬 (𝐖𝐡𝐞𝐧 𝐀𝐩𝐩𝐫𝐨𝐩𝐫𝐢𝐚𝐭𝐞) 𝐓𝐫𝐢𝐦𝐦𝐢𝐧𝐠: Eliminating a small percentage of the most extreme values from both ends of the dataset. 𝐖𝐢𝐧𝐬𝐨𝐫𝐢𝐳𝐚𝐭𝐢𝐨𝐧: Limiting extreme values by replacing them with the nearest acceptable percentile. 𝐂𝐚𝐩𝐩𝐢𝐧𝐠 𝐄𝐱𝐭𝐫𝐞𝐦𝐞 𝐕𝐚𝐥𝐮𝐞𝐬 Define upper and lower limits and replace values outside these boundaries with predefined cutoff points. 𝐃𝐚𝐭𝐚 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧 𝐋𝐨𝐠 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧: Useful for reducing skewness and minimizing the influence of very large values. 𝐒𝐪𝐮𝐚𝐫𝐞 𝐫𝐨𝐨𝐭 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧: Another effective approach for moderating extreme variations. 𝐈𝐦𝐩𝐮𝐭𝐚𝐭𝐢𝐨𝐧 𝐓𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞𝐬 𝐌𝐞𝐚𝐧 𝐨𝐫 𝐦𝐞𝐝𝐢𝐚𝐧 𝐢𝐦𝐩𝐮𝐭𝐚𝐭𝐢𝐨𝐧: Replacing extreme values with a central tendency measure. 𝐊𝐍𝐍 𝐢𝐦𝐩𝐮𝐭𝐚𝐭𝐢𝐨𝐧: Using similar data points to estimate a more reasonable value. 🧠 𝐔𝐧𝐝𝐞𝐫𝐬𝐭𝐚𝐧𝐝𝐢𝐧𝐠 𝐎𝐮𝐭𝐥𝐢𝐞𝐫𝐬 𝐌𝐚𝐭𝐭𝐞𝐫𝐬 𝐌𝐞𝐚𝐧𝐢𝐧𝐠𝐟𝐮𝐥 𝐨𝐮𝐭𝐥𝐢𝐞𝐫𝐬: Rare but valid events should often be retained. Data errors: Outliers caused by measurement or entry errors can be corrected or removed. ✅ 𝐂𝐡𝐨𝐨𝐬𝐢𝐧𝐠 𝐭𝐡𝐞 𝐑𝐢𝐠𝐡𝐭 𝐀𝐩𝐩𝐫𝐨𝐚𝐜𝐡 There’s no one-size-fits-all solution. The right technique depends on: How extreme the outliers are How frequently they occur Their impact on the analysis And, most importantly, domain knowledge 🔑 Thoughtful handling of outliers leads to more accurate models and better decision-making. Follow Sudeesh Koppisetti for such informative content on data analytics #DataAnalytics #DataAnalyst90 #SQL #Python #PowerBI #CareerGrowth #LearningResources #Books #DataPipelines #LinkedInLearning #PersonalGrowth #TechJourney
Like Comment
To view or add a comment, sign in
Muhammad Hamza Ali
3w
Report this post
STOP SEARCHING, START ANALYZING. SAVE THIS INSTEAD. The biggest bottleneck for Data Analysts isn't the code—it's the constant context switching. Stop wasting 20 minutes Googling syntax you’ve used 100 times. I’ve found the Ultimate Data Analyst Visual Cheat Sheet. 📌 WHAT’S INSIDE THIS TOOLKIT: 🔹 The Math: Descriptive & Inferential Stats (p-values, T-tests). 🔹 The Cleanup: Data Preprocessing (Outliers, Scaling, Missing Values). 🔹 The Visuals: EDA Essentials (Heatmaps, Pairplots, Boxplots). 🔹 The Logic: ML Basics & Time Series (Regression to ARIMA). 🔹 The Syntax: Python + SQL + R + Excel quick refs. 💡 THE REALITY: Data Analysis isn't about memorizing every library. It’s about knowing which method to apply and when. This guide handles the "how" so you can focus on the "why." 📥 Download, Save, and Excel. 🔁 REPOST this to help a fellow analyst save 10 hours this week. #DataAnalytics #Python #SQL #DataScience #MachineLearning #CareerGrowth #TechTips

1 Comment
Like Comment
To view or add a comment, sign in
Muhammad AbuBakr Ajmal
3w
Report this post
STOP SEARCHING, START ANALYZING. SAVE THIS INSTEAD. The biggest bottleneck for Data Analysts isn't the code—it's the constant context switching. Stop wasting 20 minutes Googling syntax you’ve used 100 times. I’ve found the Ultimate Data Analyst Visual Cheat Sheet. 📌 WHAT’S INSIDE THIS TOOLKIT: 🔹 The Math: Descriptive & Inferential Stats (p-values, T-tests). 🔹 The Cleanup: Data Preprocessing (Outliers, Scaling, Missing Values). 🔹 The Visuals: EDA Essentials (Heatmaps, Pairplots, Boxplots). 🔹 The Logic: ML Basics & Time Series (Regression to ARIMA). 🔹 The Syntax: Python + SQL + R + Excel quick refs. 💡 THE REALITY: Data Analysis isn't about memorizing every library. It’s about knowing which method to apply and when. This guide handles the "how" so you can focus on the "why." 📥 Download, Save, and Excel. 🔁 REPOST this to help a fellow analyst save 10 hours this week. #DataAnalytics #Python #SQL #DataScience #MachineLearning #CareerGrowth #TechTips

1 Comment
Like Comment
To view or add a comment, sign in
Arquam Rashid
3w
Report this post
Data Analytics Journey: Day 3 Update 🚀 Today was all about getting to know my data! Now that I can load datasets, the next step is Exploratory Data Analysis (EDA)—understanding what’s actually inside the files. 📚 What I Learned I mastered three essential Pandas commands for a quick data check: head(): To view the first few rows and get a feel for the content. info(): To check data types, columns, and identify missing values. describe(): To see a statistical summary (mean, median, etc.) of numeric data. 💻 Task Completed Dataset Exploration: Successfully performed an initial audit of my dataset in Google Colab. 🔍 💡 Why this Matters You can't analyze what you don't understand. These commands are the "first look" that helps an analyst identify patterns or errors before diving into deeper work. It’s about building a solid baseline for every project. 10% Complete. The roadmap is looking bright! #DataAnalytics #Day3 #EDA #Pandas #Python #LearningJourney #GoogleColab #DataAnalyst
Like Comment
To view or add a comment, sign in
Henry Anomah Yeboah
1w Edited
Report this post
🚀 Week 7 of 30 on my Data Engineering journey: The Art of Data Cleaning! 🧹 After focusing on data ingestion last week, it is time to tackle the reality of raw data: it is almost always messy. I have been getting hands-on with pandas to validate, clean, and transform datasets. Here is what I focused on this week: 📝 String & Type Conversion: Stripping out unwanted characters (like currency symbols) and converting object types to integers for proper analysis. 📅 Date Validation: Identifying logical errors in time-series data, such as filtering out impossible "future" sign-up dates using the datetime module. 🚧 Out-of-Range Data: Applying business logic to handle outliers—whether that means dropping them, setting custom limits with .loc, or imputing values. 🔍 Spotting Duplicates: Using the .duplicated() method with specific column subsets to accurately identify repeating records. 🛠️ Treating Duplicates: Going beyond simple drops. I learned how to use .groupby() and .agg() to combine overlapping records intelligently so no valuable data is lost. The transition from raw, messy data to a clean, structured DataFrame is incredibly satisfying! #DataEngineering #DataGlobalHub #DataCamp #Python
Like Comment
To view or add a comment, sign in
Domingo Galaz

Junior Data Analyst | Google Data Analytics Certified | Alura Latam Data Bootcamp | OCI Certified | SQL • Python • Power BI • Excel • Data Visualization | Open to Work
2w
Report this post
🚀 𝐌𝐚𝐬𝐭𝐞𝐫𝐢𝐧𝐠 𝐭𝐡𝐞 𝐅𝐨𝐮𝐧𝐝𝐚𝐭𝐢𝐨𝐧: 𝟏𝟓 𝐏𝐚𝐧𝐝𝐚𝐬 𝐂𝐨𝐦𝐦𝐚𝐧𝐝𝐬 𝐟𝐨𝐫 𝐄𝐯𝐞𝐫𝐲 𝐃𝐚𝐭𝐚 𝐀𝐧𝐚𝐥𝐲𝐬𝐭 They say 80% 𝙤𝙛 𝙙𝙖𝙩𝙖 𝙨𝙘𝙞𝙚𝙣𝙘𝙚 𝙞𝙨 𝙙𝙖𝙩𝙖 𝙘𝙡𝙚𝙖𝙣𝙞𝙣𝙜, and they aren't wrong. If you can’t clean it, you can’t analyze it. To build a solid Data Pipeline, you need a reliable toolkit. These 15 Pandas commands are the backbone of my workflow, handling about 90% of the heavy lifting in any exploratory data analysis (EDA): 🔍 𝟭. 𝗗𝗮𝘁𝗮 𝗘𝘅𝗽𝗹𝗼𝗿𝗮𝘁𝗶𝗼𝗻 & 𝗜𝗻𝘀𝗽𝗲𝗰𝘁𝗶𝗼𝗻 read_csv(): The starting point for most flat-file datasets. info(): Essential for checking data types and memory usage. head(): Quickly verify that your data loaded correctly. 🎯 𝟮. 𝗣𝗿𝗲𝗰𝗶𝘀𝗶𝗼𝗻 𝗦𝗲𝗹𝗲𝗰𝘁𝗶𝗼𝗻 loc[]: Accessing groups of rows and columns by labels. iloc[]: Integer-location based indexing for precise slicing. 🛠️ 𝟯. 𝗛𝗮𝗻𝗱𝗹𝗶𝗻𝗴 𝗠𝗶𝘀𝘀𝗶𝗻𝗴 𝗩𝗮𝗹𝘂𝗲𝘀 (𝗗𝗮𝘁𝗮 𝗜𝗻𝘁𝗲𝗴𝗿𝗶𝘁𝘆) dropna(): Removing null values to prevent skewed analysis. fillna(): Imputing missing data to maintain dataset volume. 🔄 𝟰. 𝗥𝗲𝘀𝗵𝗮𝗽𝗶𝗻𝗴 & 𝗔𝗴𝗴𝗿𝗲𝗴𝗮𝘁𝗶𝗼𝗻 groupby(): The "Split-Apply-Combine" powerhouse for finding patterns. merge(): Essential for joining relational datasets (SQL-style). 📊 𝟱. 𝗦𝘁𝗮𝘁𝗶𝘀𝘁𝗶𝗰𝗮𝗹 𝗜𝗻𝘀𝗶𝗴𝗵𝘁𝘀 describe(): Generate descriptive statistics (mean, std, percentiles) instantly. value_counts(): Perfect for understanding distribution in categorical data. 🧹 𝟲. 𝗗𝗮𝘁𝗮𝗙𝗿𝗮𝗺𝗲 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 query(): For writing clean, readable filtering conditions. drop() & rename(): Critical for maintaining a tidy, professional schema. Clean data is the difference between a project that provides value and one that provides noise. Mastering these commands ensures your Data-Driven Insights are built on a professional, accurate foundation. What is your "go-to" command that didn't make this list? Let’s discuss in the comments! 👇 #DataAnalytics #Python #Pandas #DataScience #DataCleaning #DataEngineering #Coding #DataVisualization #CareerInData #TechTips
Like Comment
To view or add a comment, sign in
Rajib Bin Alam
2w
Report this post
This post is for Data Visualization. Heatmaps look simple — but they’re one of the fastest ways to spot patterns in data. Here’s a quick way to read one: 🔹 Color = value Darker (or warmer) colors usually mean higher values, lighter colors mean lower. 🔹 Check the scale Always look at the color bar — it tells you what those colors actually represent. 🔹 Look for patterns Blocks, clusters, or gradients often reveal relationships at a glance. 🔹 Use annotations (if available) Numbers inside the cells remove guesswork and improve clarity. 🔹 For correlation heatmaps Values range from -1 to +1: +1 → strong positive relationship 0 → no relationship -1 → strong negative relationship 👉 The real power of a heatmap is not the colors — it’s how quickly it helps you see the story hidden in your data. #DataVisualization #DataScience #Analytics #Seaborn #Python
Like Comment
To view or add a comment, sign in
Mujtaba Ghauth
2w
Report this post
🚀 Mastering Data Wrangling with Pandas – My Go-To Cheat Sheet! If you're working with data, you already know how powerful Pandas is. But remembering all the functions? That’s where a solid cheat sheet becomes a game changer. Here are some key takeaways I keep coming back to 👇 🔹 Data Transformation Made Easy Reshape data with melt() and pivot() Combine datasets using concat() and merge() 🔹 Efficient Data Selection Filter rows with conditions Select columns using loc[] and iloc[] Use query() for cleaner logic 🔹 Cleaning & Preparation Handle missing values with fillna() and dropna() Remove duplicates and reset indexes 🔹 Powerful Aggregations Group data using groupby() Apply functions like mean(), sum(), count() 🔹 Feature Engineering Create new columns with assign() Apply transformations using vectorized operations 🔹 Exploration & Insights Quick summaries with describe() Understand structure using info() 💡 One concept that stood out for me: Tidy data = better analysis. Each column = a variable Each row = an observation Simple idea, but it makes everything easier and more scalable. Whether you're a beginner or experienced analyst, having these essentials at your fingertips can save hours of work. 📌 What’s your most-used Pandas function? Drop it below 👇 #DataAnalytics #Python #Pandas #DataScience #DataWrangling #Analytics #Learning #PowerBI #SQL

1 Comment
Like Comment
To view or add a comment, sign in

1,295 followers

47 Posts

View Profile Follow

More Relevant Posts

Explore related topics

Explore content categories