Pandas DataFrame Functions for Data Analysis and Machine Learning

🚀 Day 21 of My AI & Machine Learning Journey Today I learned important Pandas DataFrame functions that are widely used in real-world data analysis. 🔹 1. astype() → Change data type ipl['ID'] = ipl['ID'].astype('int32') 🔹 2. value_counts() → Count frequency ipl['Player_of_Match'].value_counts() 🔹 3. sort_values() → Sort data movies.sort_values('title_x') 🔹 4. rank() → Ranking values batsman['rank'] = batsman['runs'].rank(ascending=False) 🔹 5. sort_index() → Sort by index movies.sort_index() 🔹 6. set_index() → Set column as index df.set_index('name', inplace=True) 🔹 7. reset_index() → Reset index df.reset_index() 🔹 8. unique() → Get unique values ipl['Season'].unique() 🔹 9. nunique() → Count unique values ipl['Season'].nunique() 🔹 10. isnull() / notnull() → Check missing values students.isnull() students.notnull() 🔹 11. dropna() → Remove missing values students.dropna() 🔹 12. fillna() → Fill missing values students.fillna(0) 🔹 13. drop_duplicates() → Remove duplicates df.drop_duplicates() 🔹 14. drop() → Delete rows/columns df.drop(columns=['col1']) 🔹 15. apply() → Apply custom function df['new'] = df.apply(func, axis=1) 💡 Biggest Takeaway: These functions are essential for data cleaning, transformation, and preparation before building ML models. Learning practical data handling step by step 🚀 #MachineLearning #Python #Pandas #DataScience #DataCleaning #LearningJourney

To view or add a comment, sign in

More Relevant Posts

Kunal kumar
1w
Report this post
🚀 Day 25 of My AI & Machine Learning Journey Today I learned about MultiIndex (Hierarchical Indexing) in Pandas — a powerful way to handle higher dimensional data. 🔹 What is MultiIndex? Normally: • Series → 1D (1 index needed) • DataFrame → 2D (row + column needed) 👉 But with MultiIndex, we can use multiple levels of indexing 🔹 MultiIndex in Series We can create multiple index levels Example: index = pd.MultiIndex.from_product( [['cse','ece'], [2019,2020,2021,2022]] ) s = pd.Series([1,2,3,4,5,6,7,8], index=index) 👉 Access data s[('cse', 2022)] s['ece'] 🔹 stack() & unstack() 👉 Convert between formats • unstack() → MultiIndex → DataFrame • stack() → DataFrame → MultiIndex 🔹 Why MultiIndex? 👉 Used to represent high-dimensional data in lower dimensions Example: 5D → 2D 10D → 2D 🔹 MultiIndex in DataFrame 👉 MultiIndex in Rows df.loc['cse'] 👉 MultiIndex in Columns df['delhi'] df['mumbai']['avg_package'] 🔹 MultiIndex in Both Rows & Columns 👉 Creates higher dimensional structure branch_df3 💡 To access a value → need multiple keys (row + column levels) 💡 Biggest Takeaway: MultiIndex helps manage complex, multi-dimensional data in a structured and readable way. #MachineLearning #Python #Pandas #DataScience #DataAnalysis #LearningJourney #AdvancedPython 🚀
Like Comment
To view or add a comment, sign in
Kunal kumar
2w
Report this post
🚀 Day 17 of My AI & Machine Learning Journey Today I explored Pandas Series in depth — including its attributes, methods, and working with CSV data. 🔹 Series Attributes These help us understand the structure of data: • size → Total number of elements (including missing values) • dtype → Data type of elements • name → Name of the series • is_unique → Checks if values are unique • index → Shows index labels • values → Returns actual data 🔹 Creating Series from CSV By default, read_csv() loads data as DataFrame. To convert it into Series, we use: 👉 .squeeze() Example: Single column → Converted into Series Multiple columns → Use index_col to select index 🔹 Important Series Methods • head() → Shows first 5 rows • tail() → Shows last 5 rows • sample() → Picks random row (avoids bias) • value_counts() → Frequency of values • sort_values() → Sort data (asc/desc) • sort_index() → Sort by index 👉 Method Chaining: Combining multiple methods together Example: sort → head → value 🔹 Mathematical Operations • count() → Counts values (ignores missing) • sum() → Total • mean() → Average • median() → Middle value • mode() → Most frequent value • std() → Standard deviation • var() → Variance • min() / max() → Smallest / Largest value 🔹 describe() Method Gives a quick summary of dataset: • Count • Mean • Std • Min / Max • Percentiles (25%, 50%, 75%) 💡 Biggest Takeaway: Pandas Series provides powerful tools to analyze, clean, and understand data efficiently. Learning deeper into data handling step by step 🚀 #MachineLearning #Python #Pandas #DataScience #LearningJourney #TechGrowth
Like Comment
To view or add a comment, sign in
michael mwangi
4w
Report this post
Building a Machine Learning Model for Time Series Forecasting Over the past few days, I’ve been working on a machine learning project focused on predicting future values using real-world financial data. 🔍 What I worked on: Data collection and preprocessing using pandas Feature engineering and handling missing values Implementing regression models such as Linear Regression Training and evaluating models using scikit-learn Using historical data to forecast future trends Visualizing predictions with matplotlib 📊 Key Techniques Applied: Data cleaning and transformation Train-test splitting Model training and evaluation Time series forecasting using shifted labels Scaling features for better model performance 📈 What I achieved: Built a working model that predicts future values based on historical patterns Compared actual vs predicted results using visual plots Gained deeper understanding of how machine learning models learn from data 💡 Key takeaway: Machine learning is not just about building models—it’s about understanding data, preparing it properly, and interpreting results effectively. 🎯 Next steps: Improve model accuracy with advanced techniques Explore additional models and comparisons Build more real-world projects and expand my portfolio I’m excited to continue growing in Data Science and Machine Learning and apply these skills to real-world problems. #MachineLearning #DataScience #Python #AI #DataAnalysis #LearningJourney
Like Comment
To view or add a comment, sign in
Ehsan Ghoreishi
2w
Report this post
🚀 Choosing the Right Machine Learning Model with Scikit-Learn Selecting the perfect algorithm for your data can feel like navigating a maze. Whether you're dealing with Classification, Regression, Clustering, or Dimensionality Reduction, having a clear roadmap is a game-changer. I’ve put together this high-resolution "Cheat Sheet" based on the Scikit-Learn workflow to help you make faster, data-driven decisions. 💡 Key Takeaways from the Map: • Start Small: Always check your sample size first (\bm{>50} samples is the baseline). • Classification: Use when you need to predict a category (e.g., Spam vs. Not Spam). • Regression: Your go-to for predicting continuous values (e.g., Stock prices). • Clustering: Perfect for finding hidden patterns in unlabeled data. • Dimensionality Reduction: Essential for simplifying complex datasets without losing the "signal." 🔍 Quick Tips: 1. If you have labeled data, start with Linear SVC or SGD Classifier. 2. If you're predicting quantity and have less than 100K samples, Lasso or ElasticNet are great starting points. 3. Don't forget to scale your data before diving into these models! Which part of the ML workflow do you find most challenging? Let's discuss in the comments! 👇 #MachineLearning #DataScience #ScikitLearn #AI #Python #DataAnalytics #TechTips #MLOps
Like Comment
To view or add a comment, sign in
Adeel Ahmed
2w
Report this post
MACHINE Learning finally made… VISIBLE For the longest time, Machine Learning felt like a black box to me. Models go in → predictions come out → but what actually happens inside? Then I discovered something powerful: Visualizing ML instead of just coding it. I started exploring Jupyter notebooks that rebuild core ML algorithms from scratch not just using libraries, but actually seeing how they learn and everything changed. What clicked for me: • Convergence isn’t just theory anymore You can literally watch the model getting closer to the optimal solution • Loss landscapes become intuitive Instead of confusing graphs, they start to feel like “terrain” the model is navigating • Gradients finally make sense Not just formulas — but directional decisions the model takes step by step The biggest realization: Most people try to memorize Machine Learning but the real growth happens when you visualize and feel the learning process 📊 If you're learning ML right now, try this: Instead of jumping straight into libraries like pandas or scikit-learn… 1️⃣ Spend time understanding how things work under the hood 2️⃣ Rebuild simple models 3️⃣ Visualize every step Because once you see it… You can’t unsee it. and that’s when you stop being a “user” …and start thinking like a data scientist #MachineLearning #DataScience #Python #AI #LearningInPublic #JupyterNotebook #DeepLearning #Analytics #TechCareers #DataAnalytics
Like Comment
To view or add a comment, sign in
Kunal kumar
2w
Report this post
🚀 Day 20 of My AI & Machine Learning Journey Today I learned how to select, fetch, and filter data from a Pandas DataFrame — one of the most important skills in data analysis. 🔹 1. Selecting Data using iloc & loc • iloc → works with index positions • loc → works with index labels Example: movies.iloc[1] → fetch 2nd row movies.iloc[0:5] → first 5 rows movies.iloc[[0,5,6]] → multiple rows stud.loc['kunal'] → fetch by label stud.loc[['kunal','lakshay']] → multiple rows 🔹 2. Selecting Rows & Columns Together Using iloc: movies.iloc[0:3, 0:3] Using loc: movies.loc[0:2, 'title_x':'poster_path'] 🔹 3. Filtering Data (Very Important 🔥) Using conditions: ipl[ipl['MatchNumber'] == 'Final'] Multiple conditions: ipl[(ipl['City'] == 'Kolkata') & (ipl['WinningTeam'] == 'Chennai Super Kings')] 🔹 4. Real-World Examples • Number of Super Over matches ipl[ipl['SuperOver'] == 'Y'].shape[0] • Toss winner = Match winner % (ipl[ipl['TossWinner'] == ipl['WinningTeam']].shape[0] / ipl.shape[0]) * 100 • Movies with rating > 8 movies[movies['imdb_rating'] > 8] 🔹 5. Adding New Columns movies['Country'] = 'India' Creating from existing column: movies['lead actor'] = movies['actors'].str.split('|').apply(lambda x: x[0]) 💡 Biggest Takeaway: Data analysis is all about selecting the right data and filtering it correctly. Learning real-world data handling step by step 🚀 #MachineLearning #Python #Pandas #DataScience #DataAnalysis #LearningJourney
Like Comment
To view or add a comment, sign in
Kunal kumar
2w
Report this post
🚀 Day 19 of My AI & Machine Learning Journey Today I learned about one of the most important concepts in data analysis — Pandas DataFrame. 💡 A DataFrame is like a table (rows + columns), and each column is called a Series. 🔹 Creating DataFrame We can create DataFrame in different ways: Using List students_data = [[100,80,10],[90,70,7]] pd.DataFrame(students_data, columns=['iq','marks','package']) Using Dictionary data = {'iq':[100,90],'marks':[80,70],'package':[10,7]} pd.DataFrame(data) Using CSV (Real-world data) pd.read_csv('file.csv') 🔹 DataFrame Attributes • shape → number of rows & columns • dtypes → data types • columns → column names • values → actual data Example: movies.shape 🔹 Important Methods • head() → first rows • tail() → last rows • sample() → random rows • info() → dataset info • describe() → statistics Example: movies.head() movies.describe() 🔹 Handling Data • isnull().sum() → missing values • duplicated().sum() → duplicate rows • rename() → rename columns Example: students.rename(columns={'marks':'percent'}) 🔹 Mathematical Operations • sum() • mean() • median() Example: students.mean() students.sum(axis=1) 🔹 Selecting Data Single column → Series movies['title'] Multiple columns → DataFrame movies[['title','year']] 🔹 Setting Index We can set a column as index: students.set_index('name', inplace=True) 💡 Biggest Takeaway: DataFrame is the backbone of data analysis — every ML project starts with understanding data properly. Learning with practical examples 🚀 #MachineLearning #Python #Pandas #DataFrame #DataScience #LearningJourney #TechGrowth
1 Comment
Like Comment
To view or add a comment, sign in
Muhammad Taha
4w
Report this post
𝗗𝗮𝘆 𝟭𝟯 𝗼𝗳 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗔𝗜/𝗠𝗟 🚀 Today I dove into data preprocessing — specifically centering and scaling, one of the most impactful steps before training a model. 𝗞𝗲𝘆 𝘁𝗮𝗸𝗲𝗮𝘄𝗮𝘆𝘀: Why it matters: Features with wildly different ranges (like duration in milliseconds vs. speechiness as a decimal) can bias models that rely on distance — like KNN — making scaling essential. 𝗠𝗲𝘁𝗵𝗼𝗱𝘀 𝗰𝗼𝘃𝗲𝗿𝗲𝗱: • 𝗦𝘁𝗮𝗻𝗱𝗮𝗿𝗱𝗶𝘇𝗮𝘁𝗶𝗼𝗻 — subtract the mean, divide by variance → zero mean, unit variance • 𝗠𝗶𝗻-𝗠𝗮𝘅 𝗡𝗼𝗿𝗺𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 — scales data to [0, 1] • 𝗖𝗲𝗻𝘁𝗲𝗿𝗶𝗻𝗴 — scales data to [-1, 1] 𝗪𝗵𝗮𝘁 𝗜 𝗽𝗿𝗮𝗰𝘁𝗶𝗰𝗲𝗱 𝗶𝗻 𝘀𝗰𝗶𝗸𝗶𝘁-𝗹𝗲𝗮𝗿𝗻: Using StandardScaler from sklearn.preprocessing Applying fit_transform on training data and transform on test data (to prevent data leakage!) Building a Pipeline that chains scaling + KNN together cleanly Combining GridSearchCV with a pipeline for tuned cross-validation 𝗧𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁 𝘁𝗵𝗮𝘁 𝗯𝗹𝗲𝘄 𝗺𝘆 𝗺𝗶𝗻𝗱: KNN on unscaled data → 53% accuracy. KNN on scaled data → 81% accuracy. That's a 50%+ boost 𝗷𝘂𝘀𝘁 𝗳𝗿𝗼𝗺 𝗽𝗿𝗲𝗽𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴! 🤯 Small steps, big impact. Preprocessing isn't glamorous, but it's where good models are made. #100DaysOfML #MachineLearning #DataScience #ScikitLearn #Python #AI #LearningInPublic #Day13
Like Comment
To view or add a comment, sign in
Divya Sahu
2w
Report this post
🚀 Day 38 of My Data Science And Machine Learning Journey ColumnTransformer Building a machine learning pipeline is powerful… But what if your dataset has different types of features? 🤔 That’s where ColumnTransformer comes in! ✅ 🔍 What is ColumnTransformer? In Scikit-learn, Column Transformer allows you to apply different transformations to different columns in your dataset. 👉 Example: Scale numerical features Encode categorical features All in one step 💡 ⚙️ Why use Column Transformer? ✔️ Handles mixed data (numerical + categorical) ✔️ Applies transformations selectively ✔️ Integrates smoothly with Pipeline ✔️ Reduces manual preprocessing errors ✔️ Makes workflow cleaner & scalable 🧠 Core Idea Instead of applying transformations to the whole dataset ❌ You treat each column based on its type ✅ 👉 Numerical → Scaling 👉 Categorical → Encoding 👉 Combined → Ready for model 🔥 Real Insight Think of ColumnTransformer as a smart dispatcher 🚦 It sends each column to the right preprocessing step before feeding it into the model. 📌 Pro Tip: Combine ColumnTransformer + Pipeline to build a complete end-to-end ML workflow 🚀 #MachineLearning #DataScience #AI #Python #ScikitLearn #MLJourney #LearningInPublic
Like Comment
To view or add a comment, sign in
Muhammad Abdulkareem
2w
Report this post
Day 15/60: Turning Numbers into Stories! 📈✨ Data is powerful, but visual data is persuasive. Today for the #60DaysOfCode challenge with ABTalksOnAI and Anil Bajpai, I moved from data cleaning to Data Visualization using Matplotlib. 🎨📊 The Mission: 🎯 Take a year's worth of raw sales figures and identify the growth pattern. The Insight: 💡 Looking at a table of numbers, it’s hard to see the "big picture." But with a Line Chart, the story becomes clear instantly: you can see exactly when the seasonal peaks happen and where the growth accelerates. The Tech: 🛠️ Library: Matplotlib (The gold standard for Python plotting). Feature: Added markers and grids to make the chart readable and "boardroom ready." Why this matters for AI: 🤖 In AI, we don't just "build" models; we monitor them. We use line charts to track Loss and Accuracy. If the line goes down, the model is learning; if it stays flat, we have a problem. Visualization is the "dashboard" of the AI engine. 🏎️💨 Moving into the world of visuals feels like a whole new level of communication. Onward! 🚀 #ABTalks #60DaysOfCode #Matplotlib #DataVisualization #Python #AI #DataScience #MachineLearning #StorytellingWithData
1 Comment
Like Comment
To view or add a comment, sign in

539 followers

36 Posts

View Profile Follow

Pandas DataFrame Functions for Data Analysis and Machine Learning

More Relevant Posts

Explore related topics

Explore content categories