Remove Outliers with IQR Method in Python

🚀 Removing Outliers using IQR Method in Python Outliers can seriously impact your data analysis and model performance. Instead of ignoring them, it’s important to detect and handle them properly. 📊 One of the most reliable techniques is the Interquartile Range (IQR) method. 📌 How it works: Calculate Q1 (25th percentile) and Q3 (75th percentile) Compute IQR = Q3 − Q1 Define boundaries: Lower Fence = Q1 − 1.5 × IQR Upper Fence = Q3 + 1.5 × IQR IQR=Q3−Q1 Any value outside these boundaries is considered an outlier. import numpy as np def detect_outliers(data, k=1.5): data.sort() arr = np.array(data, dtype=float) Q1 = np.percentile(arr, 25, method='linear') Q3 = np.percentile(arr, 75, method='linear') IQR = Q3 - Q1 lower = Q1 - k * IQR upper = Q3 + k * IQR mask = (arr >= lower) & (arr <= upper) outliers_mask = ~mask return { "outliers": arr[outliers_mask].tolist(), "clean_data": arr[mask].tolist() } student_score = [10, 12, 45, 34, 20, 33, 35, 40, 55, 44, 48, 53, 90, 98] print(detect_outliers(student_score)) 📈 Output Insight: Outliers detected → [98] Clean data → Remaining values within range 🎯 Why use IQR? ✅ Robust to skewed data ✅ Easy to implement ✅ Works well for real-world datasets ⚠️ Tip: Don’t blindly remove outliers — sometimes they carry valuable insights! 💬 Good data preprocessing leads to better models. #DataScience #Python #MachineLearning #DataAnalytics #Statistics #Pandas #AI #Learning

To view or add a comment, sign in

More Relevant Posts

Python Valley

19,973 followers
3w
Report this post
Stop guessing Python libraries Use the right tool for the task Start learning → https://lnkd.in/dBMXaiCv ⬇️ What to use and when Data handling • pandas → tables joins cleaning • NumPy → arrays math speed Visualization • Matplotlib → full control • Seaborn → quick stats plots • Plotly → interactive dashboards Machine learning • scikit-learn → models pipelines metrics • statsmodels → statistical tests Boosting • XGBoost → strong on tabular • LightGBM → fast large data • CatBoost → handles categories AutoML • PyCaret → fast experiments • H2O → scalable models • FLAML → cost efficient tuning Deep learning • PyTorch → flexible research • TensorFlow → production ready • Keras → simple interface NLP • spaCy → production pipelines • NLTK → basics • Transformers → pretrained models ⬇️ Simple path Start pandas + scikit-learn Then add Plotly Then try XGBoost Then move to PyTorch if needed This is the exact stack used in real projects ⬇️ Learn step by step Best Python Courses https://lnkd.in/dAJCHqaj Data Science Guide https://lnkd.in/dxgvqnVs AI Courses https://lnkd.in/dqQDSEEA Question Which library do you use most today #Python #DataScience #MachineLearning #AI #ProgrammingValley
Like Comment
To view or add a comment, sign in
Edson Pereira
2w
Report this post
Python is much more than a scripting language in data projects. It is often the bridge between raw tabular data and real machine learning value. In real-world scenarios, structured tables rarely arrive “ML-ready.” They need cleaning, standardization, feature engineering, missing value treatment, categorical encoding, scaling, and validation before any model can generate trustworthy results. That is where Python becomes a strategic tool. With libraries like pandas, NumPy, and scikit-learn, it turns messy business data into high-quality datasets prepared for prediction, classification, clustering, and optimization. A good ML model does not start with the algorithm. It starts with well-transformed data. In many projects, the real competitive advantage is not only building the model, but designing a transformation pipeline that is: • scalable • reproducible • explainable • production-ready That is why strong data professionals know: better data transformation > more complex models How much of your ML success comes from modeling itself, and how much comes from data preparation? #Python #MachineLearning #DataEngineering #DataScience #FeatureEngineering #ETL #DataPreparation #AI #Analytics #LinkedInTech
Like Comment
To view or add a comment, sign in
Shivani Singh
1w
Report this post
🧠 Group Anagrams: The "Fingerprint" Strategy In this problem, I moved beyond the standard sorting approach (O(n .m log m)) to a more efficient Frequency Array strategy (O(n . m)). Memory Management: I learned how Python handles memory during loops. By declaring count = [0] * 26 inside the outer loop, I’m giving each word a fresh "sheet of paper" to record its letter frequency. Once that word is processed and "locked" as a tuple (to serve as a dictionary key), Python’s Garbage Collector steps in to clean up the old list. The Data Science Connection: This frequency array isn't just a coding trick; it's the foundation of One-Hot Encoding and Bag of Words in Data Science. It’s how we turn raw text into numerical vectors that AI models can actually understand. 🔍 Longest Common Prefix: The Power of Vertical Scanning Instead of checking one word at a time, I focused on Vertical Scanning—checking the first letter of every word, then the second, and so on. Complexity: Achieved O(S) time complexity. By using the shortest word as my base, I ensured zero wasted cycles and no IndexError traps. Pythonic Elegance: I explored the zip(*strs) strategy. It’s amazing how Python can "unpack" a list and group characters by their index in a single line. The Sorting Shortcut: A clever logic leap—if you sort the list, you only need to compare the first and last strings. If they share a prefix, everything in the middle must share it too. The takeaway? Code isn't just about getting the right answer; it's about knowing how your data sits in RAM and how to make every operation count. Onto the next one! 🐍💻 #DataScience #Python #SoftwareEngineering #Neetcode#ProblemSolving #TechLearning "6 down, 244 to go. The dashboard might show 6/250, but the real progress is in the 'Medium' difficulty milestone I hit today and the logic I've mastered behind the scenes."
Like Comment
To view or add a comment, sign in
Gustavo R Santos
3w
Report this post
Standard classification models tell you if a customer will leave, but Survival Analysis tells you <<when>>. I just published a new deep dive into Survival Analysis using Python and the lifelines library. Using telco churn data, I explore: ✅ The Kaplan-Meier Estimator: Visualizing the "survival" journey of a subscriber. ✅ Cox Proportional Hazards: Identifying exactly which behaviors (like high charges or complaints) accelerate the risk of churn. ✅ Censoring: How to handle customers who haven't churned yet without biasing your data. Treating churn like a timeline. Check out the full article and breakdown at Towards Data Science: https://lnkd.in/evH9Fk2R #DataScience #MachineLearning #SurvivalAnalysis #Python #ChurnPrediction #Analytics

A Survival Analysis Guide with Python: Using Time-To-Event Models to Forecast Customer Lifetime | Towards Data Science https://towardsdatascience.com
Like Comment
To view or add a comment, sign in
Shaurab Kumar Jha
3w Edited
Report this post
Day 2: Mastering the Architecture of Data – Python Data Structures! 🏗️ for Gen AI Revision After laying the foundation yesterday, Day 2 was all about the building blocks. In Gen AI development, how you store and manipulate data (tokens, embeddings, prompts) defines the efficiency of your model. Today was a deep dive into Python Data Structures. It’s not just about knowing list or dict; it’s about knowing why and where to use them for memory efficiency and speed. 🧠 What I Mastered Today: Strings & Immutability: Deep dive into slicing, advanced formatting (f-strings), and understanding why strings are immutable—a key concept when handling large text datasets for LLMs. Lists & Tuples: Beyond basic indexing. Focused on list comprehensions for clean code and using tuples for data integrity (immutable sequences). Sets for Performance: Leveraging hash-based lookups for unique element extraction and mathematical set operations (union/intersection)—crucial for data preprocessing. Dictionaries (The Powerhouse): Building efficient word frequency counters and nested structures. Understanding O(1) complexity for fast data retrieval. I didn't just read theory; I solved 15+ mini-problems ranging from character frequency analysis to complex list flattening—all without using external libraries to keep the logic raw and sharp. 💻 GitHub Progress: I’ve pushed the practice.py file with all 15+ solved challenges to my repo: day02_data_structures/ 🔗 https://lnkd.in/gikzc-K8 The journey to an MNC as a Gen AI dev is about consistency. Two days down, 88 to go. 🚀 #Python #DataStructures #GenAI #GenerativeAI #100DaysOfCode #AIDevelopment #TechJourney #MNCGoal #RevisionSeries #BackendDevelopment
Like Comment
To view or add a comment, sign in
Kunal kumar
4d
Report this post
🚀 Day 30 of My AI & Machine Learning Journey Today I learned about Timestamp in Pandas — how machines understand date & time data efficiently. 🔹 Step 1: What is a Timestamp? A Timestamp represents a specific moment in time 👉 Example: Oct 24, 2022 → a date April 16, 2026, 4:05 PM → exact time 🔹 Step 2: Creating Timestamp pd.Timestamp('2022-10-24') pd.Timestamp('2022') pd.Timestamp('16 April 2026') pd.Timestamp('2026-04-16 04:17') 💡 Pandas is smart — it understands different formats automatically 🔹 Step 3: Using Python datetime import datetime as dt dt.datetime(2026, 4, 16, 4, 21, 56) pd.Timestamp(dt.datetime(2026, 4, 16, 4, 21, 56)) 👉 Convert Python datetime → Pandas Timestamp 🔹 Step 4: Extracting Information x.year x.month x.day x.hour x.minute x.second 👉 Easily access parts of date/time 🔹 Step 5: Why Pandas Timestamp? ❓ Python datetime already exists… so why Pandas? 👉 Python datetime = easy but slow 👉 Pandas Timestamp = fast + scalable 🔹 Step 6: Power of NumPy datetime64 np.array('2026-04-16', dtype='datetime64') 👉 Stores date as 64-bit integer 👉 Very fast for large datasets 🔹 Step 7: Final Understanding 👉 Pandas Timestamp = Python datetime (easy) + NumPy datetime64 (fast) 👉 Used for: • Time series data • Data analysis • Machine learning pipelines 💡 Final Realization Handling date & time is not just about storing values… It’s about performance + flexibility + analysis 🚀 #MachineLearning #Python #Pandas #DataScience #TimeSeries #LearningJourney #DataAnalysis 🚀
1 Comment
Like Comment
To view or add a comment, sign in
AMUNDLA PAVAN
2w
Report this post
🔍 Exploratory Data Analysis (EDA) with Python Before building any model, you need to understand your data. That's exactly what EDA is about. EDA is the process of investigating datasets to discover patterns, spot anomalies, test hypotheses, and check assumptions — using visual and statistical methods. Here's how I approach it with Python: 1. Load & Inspect the Data python import pandas as pd df = pd.read_csv("data.csv") df.head() df.info() df.describe() → Understand shape, dtypes, null values, and basic statistics right away. 2. Handle Missing Values python df.isnull().sum() df.fillna(df.median(), inplace=True) → Never ignore nulls — they skew your results silently. 3. Univariate Analysis python import seaborn as sns sns.histplot(df['age'], kde=True) → Understand the distribution of each feature individually. 4. Bivariate & Multivariate Analysis python sns.heatmap(df.corr(), annot=True, cmap='coolwarm') sns.pairplot(df, hue='target') → Find correlations and relationships between features. 5. Detect Outliers python sns.boxplot(x=df['salary']) → Outliers can destroy model performance if ignored. 6. Feature Distribution by Class python sns.violinplot(x='target', y='feature', data=df) → See how features behave across different classes. 💡 EDA is not optional — it's the foundation of every reliable ML pipeline. The better you understand your data, the better your model will be. What's your go-to EDA library? Drop it in the comments 👇 #DataScience #Python #EDA #MachineLearning #Pandas #Seaborn #Analytics #DataAnalysis #AI
Like Comment
To view or add a comment, sign in
Rebecca Matos
1w
Report this post
Why data visualization is so important? There’s a famous statistical example called Anscombe’s quartet that perfectly illustrates this. It consists of four datasets and their descriptive statistics are the same: They have the same mean, variance, correlation and even regression line. But this “average behavior” tells very little about what’s actually going on with the data. When the data is plotted, we see a completely different pattern: • One shows a clear linear relationship • Another hides a curve • One is driven by a single outlier • Another looks random except for one influential point This is why visualization matters: 👉 It exposes patterns that summary metrics hide 👉 It reveals outliers that can mislead your models 👉 It helps avoid false conclusions 👉 It turns abstract numbers into intuitive insight And the best part? It’s incredibly easy to get started. With Python, just a few lines using libraries like matplotlib or seaborn can completely change how you understand your data. A simple scatter plot can reveal what pages of statistics cannot. Before you trust the model, plot the data. #DataScience #DataVisualization #Python #Analytics #MachineLearning #DataAnalytics #BigData #DataDriven #Statistics #AI #ArtificialIntelligence #DataLiteracy #BusinessIntelligence #DataStorytelling #Insight #PredictiveModeling #DeepLearning #ExploratoryDataAnalysis #STEM #Tech #Innovation
Like Comment
To view or add a comment, sign in
RAHUL PRASAD
1w Edited
Report this post
🚀 Built an AI Data Analyzer using Python & Streamlit I developed an AI-powered application that converts raw, unstructured data into meaningful insights. 🔍 Key Features: • Supports CSV, Excel, TXT, PDF • AI cleans and structures raw data • Generates tables and visualizations (Bar & Pie Charts) • Provides AI-based insights • Exports final results as a PDF report ⚡ Workflow: Upload → AI Cleaning → Data Preview → Charts → AI Insights → PDF Report 🎥 Demo Video: https://lnkd.in/gD5h_REg 📂 GitHub Repo: https://lnkd.in/g2g94Vq3 💼 Let’s connect: https://lnkd.in/gbEr9cKj #AI #MachineLearning #DataAnalysis #Python #Streamlit #Projects #DataScience
Like Comment
To view or add a comment, sign in
Mounica Tamalampudi
1mo
Report this post
🚀 Data Cleaning in Python Cheat Sheet I created this visual guide to help beginners understand the most important steps in data cleaning using Python and Pandas. Data cleaning is one of the most important parts of any data project, and this cheat sheet covers the full workflow from start to finish. 👉 What this cheat sheet includes - Importing essential libraries - Understanding data structure using info and head - Exploring data with describe and value counts - Standardizing formats like dates and text - Removing duplicate rows - Handling missing values with fill or drop - Fixing inconsistent strings - Filtering logically incorrect data - Removing outliers using the IQR method - Renaming columns for clean and readable datasets - Saving cleaned data safely This is a great quick reference for anyone learning data analysis, preparing datasets or doing real world projects. 👤 Follow Mounica Tamalampudi for more content on Data Science, AI, ML, and Agentic AI 💾 Save this post for future reference 🔁 Repost if this helps your network #DataCleaning #Python #Pandas #DataScience #DataPreparation #DataAnalysis #ML #AI #MachineLearning #Analytics #DataEngineer #DataAnalyst #TechLearning #AgenticAI #LLM #MLOps #LLMOps #DeepLearning #DL

1 Comment
Like Comment
To view or add a comment, sign in

2,231 followers

56 Posts

View Profile Connect

Remove Outliers with IQR Method in Python

More Relevant Posts

Explore content categories