Matúš Senci’s Post

2mo

Python in Data Science #006 A funny thing happens in real projects: the “modeling work” starts failing, and the root cause is almost always upstream. Not because the algorithm is wrong, but because the data cleaning was ad-hoc, inconsistent, and almost impossible to reproduce. Always treat data cleaning as a repeatable, versioned transformation, and never clean directly on raw data. A cheatsheet is useful, but the real upgrade is turning those steps (missing values, duplicates, types, outliers, invalid rows) into a predictable workflow you can rerun tomorrow and get the same dataset. It also reduces silent leakage: if you “peek” at the full dataset to decide thresholds or imputation, you can accidentally bake test-set information into training. The trade-off is a bit more upfront discipline, but you gain trust: in your results, in your features, and in your handoffs to stakeholders. df_raw = pd.read_csv("data.csv") df = df_raw.copy() df = df.drop_duplicates() df["date"] = pd.to_datetime(df["date"], errors="coerce") df["sales"] = df["sales"].fillna(0) df["name"] = df["name"].str.strip().str.lower() df = df[df["sales"] >= 0] What it improves: reproducibility, debugging speed, and confidence that changes are intentional (not accidental) Common mistake/trap: “quick fixes” in-place on raw data, then forgetting what was changed (or applying different rules each run) When I’d tune it (or when I wouldn’t): I tune cleaning rules only on the training split (thresholds, outlier caps, imputations); I don’t touch rules based on the full dataset. #python #datascience #datacleaning

To view or add a comment, sign in

More Relevant Posts

Sahin Parvez
2mo Edited
Report this post
I made an ML model named - "Student Success Prediction" Using - Python Libraries, ML Libraries, Scikit Learn Step by Step ML Project : Step 1 : Load & Understand Data, Checking Data Using : df = pd.read_csv("File_Name") df.head() df.shape() df.info() df.describe() df.dtypes() df.isnull().sum() Step 2 : Transform Categorical Data to Numerical Data Using : from sklearn.preprocessing import LabelEncoder le = LabelEncoder() le.fit_transform(df['Column Name']) Step 3 : Feature Scaling (Standarize features to improve model performance) Using : from sklearn.preprocessing import Standard, LabelEncoder scaler = StandardScaler() scaler.fit_transform(df['Column Name']) Step 4 : Split the data (Divide the data set into training, validation & test sets) Using : from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 42) Step 5 : Train the Model (train LogisticRegression Model to Train The Model) Using : from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X_train, y_train) Step 6 : Making Prediction (Use the trained the data to generate predictions on new data) Using : from sklearn.metrics import classification_report y_pred = model.predict(X_test) classification_report(y_test,y_pred) Step 7 : Evaluate the Model (Access the model data using appropriate metrics) Using : from sklearn.metrics import confusion_matrix conf_matrix = confusion_matrix(y_test,y_pred) Step 8 : Visualization (Create Visualization to communicate findings & inshights) Used Matplotlib functionality Step 9 : Improvement/Experiment GitHub Repo :- https://lnkd.in/diXyPJUA
Like Comment
To view or add a comment, sign in
Anuj Mishra
2mo
Report this post
📌 TOOL 3: Python (Main Tool for Data Science) 🗓️ Day 24: Seaborn – Simple Plots 📊✨ Today I learned Seaborn — a powerful Python library for data visualization. Seaborn helps us create beautiful and simple graphs easily. It works very well with Pandas DataFrames. 🔹 Why Seaborn? ✅ Easy to use ✅ Clean and attractive charts ✅ Built on Matplotlib ✅ Best for statistical data visualization 📌 Some Simple Plots in Seaborn: 1️⃣ Line Plot – To see trends over time 📈 import seaborn as sns sns.lineplot(x="day", y="total_bill", data=tips) 2️⃣ Bar Plot – To compare categories 📊 sns.barplot(x="day", y="total_bill", data=tips) 3️⃣ Histogram – To check data distribution 📉 sns.histplot(data=tips["total_bill"]) 4️⃣ Box Plot – To detect outliers 📦 sns.boxplot(x="day", y="total_bill", data=tips) Seaborn makes data easy to understand and explain. Good visualization = Better decisions 💡 Learning step by step. Consistency is the key 🔥 #Python #DataScience #Seaborn #LearningJourney #DataVisualization Ulhas Narwade (Cloud Messenger☁️📨)
Like Comment
To view or add a comment, sign in
Jwel Aktar
2mo
Report this post
📊 Why Outlier Detection Matters in Data Analysis (Using Python) In data analysis, not all data points are created equal. Some values deviate significantly from the norm — these are known as outliers. If ignored, they can distort results, mislead insights, and impact business decisions. Using Python libraries such as Pandas, NumPy, and Matplotlib, data analysts can efficiently detect and handle outliers through techniques like: ✔️ Z-Score ✔️ IQR (Interquartile Range) ✔️ Boxplots & Scatter Plots ✔️ Statistical Thresholding 🔍 Why is Outlier Detection Important? • Improves model accuracy • Prevents misleading conclusions • Enhances data quality • Helps identify fraud, anomalies, and rare events • Supports better decision-making Outlier detection is not just about removing extreme values — it’s about understanding the story behind the data. Clean data leads to confident insights. Confident insights drive better business outcomes. 🚀 #DataAnalytics #Python #DataScience #MachineLearning #DataCleaning #Analytics

2 Comments
Like Comment
To view or add a comment, sign in
Alejandro Paúl Aldas
2mo
Report this post
#python Fuzzy Entity Resolution When "John Doe" in System A is "J. Doe" in System B, a standard SQL JOIN fails. This leads to fragmented analytics and "duplicate" customers that cost companies money in marketing and shipping. The Proposed Solution: Deduplication via Levenshtein Distance We can use Python’s pandas for data handling and thefuzz (formerly FuzzyWuzzy) to calculate the similarity between strings. The Strategy: Preprocessing: Standardize casing and remove special characters. Scoring: Use the Levenshtein distance to calculate a similarity score (0 to 100). Thresholding: Automatically link records that score above 90\%. import pandas as pd from fuzzymatcher import fuzzy_left_join # Sample Data crm_data = pd.DataFrame({'id': [1, 2], 'name': ['Acme Corp', 'Globex Corp']}) billing_data = pd.DataFrame({'id': [101, 102], 'client_name': ['Acme Corporation', 'Globex']}) # Fuzzy matching the dataframes matched_results = fuzzy_left_join(crm_data, billing_data, left_on="name", right_on="client_name") print(matched_results[['name', 'client_name']])
Like Comment
To view or add a comment, sign in
Hamed Taeb
2mo
Report this post
We talk a lot about Python in data science. But there are situations where SQL is not just sufficient It's objectively better. Here’s when SQL wins: ✅ When the data is massive Pulling billions of rows into Python is not “advanced.” It's inefficient. ✅ When computation should stay in the warehouse Modern warehouses are optimized for aggregation and joins at scale. ✅ When transformations need to be reproducible SQL inside version-controlled models (dbt, warehouse views) creates clean lineage. ✅ When performance matters Moving data across systems adds cost, latency, and risk. A common mistake in junior teams: Extract first → Process later. A mature approach: Compute where the data lives. 🎯 Python is powerful. But SQL is often the most scalable decision. #DataEngineering #SQL #DataScience #Analytics #ModernDataStack #DecisionScience #DataDrivenWisdom
Like Comment
To view or add a comment, sign in
Alejandro Paúl Aldas
2mo
Report this post
#python Silent Model Decay Standard pipelines often run successfully as long as the data format is correct, even if the values are nonsensical or wildly different from training data. Without an automated check, you only realize the data is "bad" weeks later when KPIs tank. The Proposed Solution: Automated Drift Validation We can use Python with the evidently or scipy libraries to create a "Circuit Breaker." This script compares the distribution of a new batch of data against a reference baseline using the Kolmogorov-Smirnov test. import numpy as np from scipy.stats import ks_2samp def detect_drift(reference_data, current_data, threshold=0.05): """ Compares two data distributions. Returns True if drift is detected (p-value < threshold). """ statistic, p_value = ks_2samp(reference_data, current_data) if p_value < threshold: print(f"⚠️ DRIFT DETECTED: p-value {p_value:.4f}") return True print("✅ Data stable.") return False # Example Usage baseline_sales = np.random.normal(100, 20, 1000) # Past data new_batch_sales = np.random.normal(115, 25, 1000) # New data with a shift if detect_drift(baseline_sales, new_batch_sales): # Trigger alert or stop the pipeline raise ValueError("Pipeline halted: Input data distribution has shifted.")
Like Comment
To view or add a comment, sign in
Yasser Mustafa, PhD
2mo Edited
Report this post
🚀 After many years building production AI systems, I got tired of writing Pandas code that worked but nobody could read. So, I built 𝐏𝐢𝐩𝐞𝐅𝐫𝐚𝐦𝐞; an open-source Python library where data manipulation reads like plain English, using simple verbs like 𝑓𝑖𝑙𝑡𝑒𝑟(), 𝑑𝑒𝑓𝑖𝑛𝑒(), and 𝑎𝑟𝑟𝑎𝑛𝑔𝑒() chained together with a ">>" operator that flows exactly the way you think. I just published a full article covering the story behind it, the design philosophy, and a beginner-friendly tutorial with real examples. Link: https://lnkd.in/ey8bqWbN and if it resonates, a ⭐ on GitHub means the world 🙏 ↳ https://lnkd.in/eBRyHBna #Python #DataScience #OpenSource #Pandas #DataEngineering #PipeFrame #MachineLearning

GitHub - Yasser03/pipeframe: A modern, intuitive data manipulation library for Python that makes your data workflows read like natural language. Built on pandas' robust foundation with a clean, pipe-based syntax inspired by R's dplyr and tidyverse. Built with ❤️ for data scientists who value readability. Make your data speak naturally with PipeFrame 🔄. github.com

4 Comments
Like Comment
To view or add a comment, sign in
Chaman Singh
2mo
Report this post
🚀 Day 16 – Data Cleaning using Pandas 📗 Learn to Clean Data Step by Step This poster clearly explains how to perform Data Cleaning using Pandas in a simple and practical way. Data Cleaning is one of the most important steps in Data Analysis because raw data often contains missing values, errors, and inconsistencies. 🔹 Step 1: Import & Load Data First, we import the Pandas library and load the dataset: Python Copy code import pandas as pd df = pd.read_csv('data.csv') This step allows us to read and work with the dataset inside Python. 🔹 Step 2: Detect Missing Values To check missing (NaN) values: Python Copy code print(df.isnull().sum()) This shows how many missing values are present in each column like: Name Age Salary Department Identifying missing data is important before applying any solution. 🔹 Step 3: Remove Missing Values If we want to remove rows with missing values: Python Copy code df.dropna(inplace=True) This removes incomplete rows from the dataset. 🔹 Step 4: Fill Missing Values Instead of removing data, we can fill missing values: Python Copy code df.fillna(df.mean(), inplace=True) This replaces missing numerical values with the mean of the column. 🎯 Key Learning from the Poster ✔ How to load data using Pandas ✔ How to detect missing values ✔ How to remove or fill missing data ✔ Why clean data is important before analysis Clean data leads to: Accurate analysis Better visualization Reliable machine learning models Special thanks to my mentor Tajwar Khan Sir for continuous guidance 🙏 Grateful to Ethical Learner for providing practical learning experience. Learning & improving every day 🚀 #Day16 #DataCleaning #Pandas #Python #DataScienceJourney #EthicalLearner #LearningInPublic
Like Comment
To view or add a comment, sign in
MANOJ KUMAR K G
2mo
Report this post
🚀 Next 30 Days Plan – Python for AI/ML After completing MySQL fundamentals, the next 30 days are fully dedicated to mastering Python and applying it practically for AI/ML. 🔹 Week 1 – Core Foundations • Python basics & syntax • Data Structures (Lists, Tuples, Sets, Dictionaries) • Control Structures (if, for, while, break, exceptions) 🔹 Week 2 – Programming Depth • High Order Functions • Lambda, map, filter • File Handling • Writing clean, modular code 🔹 Week 3 – OOPS & Integration • Classes & Objects • Inheritance & Encapsulation • Python with SQL • Database connectivity 🔹 Week 4 – Data & Mini Projects • Pandas for data manipulation • Web Scraping • Basic ETL pipeline • Build 1–2 small Streamlit apps 🎯 Goal: Not just learning syntax — but building logical thinking and real-world implementation skills. I’ll be sharing progress as I move forward. #Python #MachineLearning #ArtificialIntelligence #DataScience #CodingChallenge #Pandas #ETL #DataEngineer #MachineLearningEngineer #PythonDeveloper #SoftwareEngineer #AICommunity #BuildInPublic #ContinuousLearning #CareerInTech
Like Comment
To view or add a comment, sign in

167 followers

12 Posts

View Profile Follow

Matúš Senci’s Post

More Relevant Posts

Explore related topics

Explore content categories