Mastering MultiIndex in Pandas for Higher Dimensional Data

🚀 Day 25 of My AI & Machine Learning Journey Today I learned about MultiIndex (Hierarchical Indexing) in Pandas — a powerful way to handle higher dimensional data. 🔹 What is MultiIndex? Normally: • Series → 1D (1 index needed) • DataFrame → 2D (row + column needed) 👉 But with MultiIndex, we can use multiple levels of indexing 🔹 MultiIndex in Series We can create multiple index levels Example: index = pd.MultiIndex.from_product( [['cse','ece'], [2019,2020,2021,2022]] ) s = pd.Series([1,2,3,4,5,6,7,8], index=index) 👉 Access data s[('cse', 2022)] s['ece'] 🔹 stack() & unstack() 👉 Convert between formats • unstack() → MultiIndex → DataFrame • stack() → DataFrame → MultiIndex 🔹 Why MultiIndex? 👉 Used to represent high-dimensional data in lower dimensions Example: 5D → 2D 10D → 2D 🔹 MultiIndex in DataFrame 👉 MultiIndex in Rows df.loc['cse'] 👉 MultiIndex in Columns df['delhi'] df['mumbai']['avg_package'] 🔹 MultiIndex in Both Rows & Columns 👉 Creates higher dimensional structure branch_df3 💡 To access a value → need multiple keys (row + column levels) 💡 Biggest Takeaway: MultiIndex helps manage complex, multi-dimensional data in a structured and readable way. #MachineLearning #Python #Pandas #DataScience #DataAnalysis #LearningJourney #AdvancedPython 🚀

To view or add a comment, sign in

More Relevant Posts

Kunal kumar
2w
Report this post
🚀 Day 21 of My AI & Machine Learning Journey Today I learned important Pandas DataFrame functions that are widely used in real-world data analysis. 🔹 1. astype() → Change data type ipl['ID'] = ipl['ID'].astype('int32') 🔹 2. value_counts() → Count frequency ipl['Player_of_Match'].value_counts() 🔹 3. sort_values() → Sort data movies.sort_values('title_x') 🔹 4. rank() → Ranking values batsman['rank'] = batsman['runs'].rank(ascending=False) 🔹 5. sort_index() → Sort by index movies.sort_index() 🔹 6. set_index() → Set column as index df.set_index('name', inplace=True) 🔹 7. reset_index() → Reset index df.reset_index() 🔹 8. unique() → Get unique values ipl['Season'].unique() 🔹 9. nunique() → Count unique values ipl['Season'].nunique() 🔹 10. isnull() / notnull() → Check missing values students.isnull() students.notnull() 🔹 11. dropna() → Remove missing values students.dropna() 🔹 12. fillna() → Fill missing values students.fillna(0) 🔹 13. drop_duplicates() → Remove duplicates df.drop_duplicates() 🔹 14. drop() → Delete rows/columns df.drop(columns=['col1']) 🔹 15. apply() → Apply custom function df['new'] = df.apply(func, axis=1) 💡 Biggest Takeaway: These functions are essential for data cleaning, transformation, and preparation before building ML models. Learning practical data handling step by step 🚀 #MachineLearning #Python #Pandas #DataScience #DataCleaning #LearningJourney
Like Comment
To view or add a comment, sign in
Kunal kumar
2w
Report this post
🚀 Day 17 of My AI & Machine Learning Journey Today I explored Pandas Series in depth — including its attributes, methods, and working with CSV data. 🔹 Series Attributes These help us understand the structure of data: • size → Total number of elements (including missing values) • dtype → Data type of elements • name → Name of the series • is_unique → Checks if values are unique • index → Shows index labels • values → Returns actual data 🔹 Creating Series from CSV By default, read_csv() loads data as DataFrame. To convert it into Series, we use: 👉 .squeeze() Example: Single column → Converted into Series Multiple columns → Use index_col to select index 🔹 Important Series Methods • head() → Shows first 5 rows • tail() → Shows last 5 rows • sample() → Picks random row (avoids bias) • value_counts() → Frequency of values • sort_values() → Sort data (asc/desc) • sort_index() → Sort by index 👉 Method Chaining: Combining multiple methods together Example: sort → head → value 🔹 Mathematical Operations • count() → Counts values (ignores missing) • sum() → Total • mean() → Average • median() → Middle value • mode() → Most frequent value • std() → Standard deviation • var() → Variance • min() / max() → Smallest / Largest value 🔹 describe() Method Gives a quick summary of dataset: • Count • Mean • Std • Min / Max • Percentiles (25%, 50%, 75%) 💡 Biggest Takeaway: Pandas Series provides powerful tools to analyze, clean, and understand data efficiently. Learning deeper into data handling step by step 🚀 #MachineLearning #Python #Pandas #DataScience #LearningJourney #TechGrowth
Like Comment
To view or add a comment, sign in
Isaac S.
2w
Report this post
I wanted to do something different. A simple guide to teach everyone how to REALLY use AI in their day-to-day work. Here's how I solved one of the most annoying problem in data modelling: DATE FORMATS. Every source system has its own opinion. One writes 01/04/2025. Another writes Apr 1, 2025. Another writes 20250401. They all mean the same thing — your brain can differentiate all the dates but your pipeline doesn't know that. The old fix? Write regex until you question life. Handle every format. Pray you don't miss one. Update it every time a new source lands. We took a different approach. ───────────────────────── We deployed Meta Llama 3.2 1B Instruct on Databricks Model Serving — a genuinely lightweight, 1B parameter model — as a single REST endpoint. The idea is simple: → Load your table into pandas → Send each raw date value to the model with one instruction: "convert this to YYYY-MM-DD" → Write the standardised result back to the table (in a new column of course, don't overwrite it you silly) That's it. No hardcoded format list. No regex hell. ───────────────────────── To handle large tables without waiting forever, we run 20 parallel workers using Python's ThreadPoolExecutor. Each worker fires off requests to the endpoint concurrently — so a table that would take minutes row-by-row finishes in a fraction of the time. ───────────────────────── Why does this beat hardcoding? The model handles edge cases you'd never think to write rules for — ambiguous formats, locale differences (is 05/06 May 6th or June 5th?), partial dates, nulls. When your parsing logic needs to change? You update the prompt in one place. Every pipeline picks it up automatically. A 1B model doing a focused job well beats a complicated rule engine every time. #Databricks #DataEngineering #DataQuality #LLM #Llama #ModelServing #LakehouseArchitecture #IsaacSong
Like Comment
To view or add a comment, sign in
Akshay Atanure
4w
Report this post
🚀 End-to-End Machine Learning Pipeline – From Data to Deployment In my recent project, I implemented a complete machine learning workflow covering all stages from data extraction to deployment. Here’s the structured pipeline I followed: 🔹 Data Extraction SQL queries, APIs, and file-based sources 🔹 Data Loading & Transformation Pandas and NumPy for cleaning, handling missing values, and feature creation 🔹 Exploratory Data Analysis (EDA) Understanding distributions, correlations, and class imbalance 🔹 Train-Test Split Using stratified sampling to preserve class distribution 🔹 Feature Engineering & Transformation ColumnTransformer, StandardScaler, and encoding techniques 🔹 Model Building Logistic Regression, KNN, Naive Bayes, and ensemble models 🔹 Model Evaluation Cross-validation with focus on PR-AUC, Recall, and F1-score 🔹 Hyperparameter Tuning GridSearchCV / RandomizedSearchCV for optimization 🔹 Final Evaluation Confusion Matrix and Precision-Recall tradeoff analysis 🔹 Deployment Built an interactive application using Streamlit 💡 Key Learning: Building a model is just one part — designing a robust pipeline and evaluating it correctly is what makes it production-ready. #MachineLearning #DataScience #MLOps #Python #AI #EndToEnd #Streamlit #DataAnalytics
Like Comment
To view or add a comment, sign in
Ayomide olaleye
3w
Report this post
Machine Learning/Artificial Intelligence Day 12. Today, I worked on a large sales dataset and ran 7 different analyses to uncover hidden patterns.What I did:First, I loaded the dataset into Jupyter using pandas. The data had thousands of rows with sales records across different regions, products, and shipping methods.Then I asked specific questions:1. Which region made the most sales?2. Which product sold the highest quantity?3. Which ship mode had the most delays?4. How do sales trend across different months?5. Which product category brings in the most revenue?6. Is there a relationship between discount and profit?7. Which region prefers which ship mode?Tools I used:· pandas – to clean, filter, and group the data· seaborn & matplotlib – to create histograms, bar charts, pie charts, and line graphs· Jupyter – for writing and testing my code· Google Colab – to share the notebook and collaborate· GitHub – to save, track, and share my workWhat I found:One region alone made up 40% of total sales. One product sold three times more than others. And the fastest ship mode actually had the most late deliveries – a surprising insight.Why this matters:For AI/ML, understanding your data before building models is half the work. A good chart can save hours of wrong assumptions. And sharing work on GitHub keeps everything organized and open for feedback.What I learned today:EDA is not just about making charts. It is about asking the right questions. Visualization is not just about pretty colors. It is about telling a clear story. Collaboration is not just about sharing files. It is about making your work useful to others.Learning step by step, staying consistent every day!#M4ACE LearningChallenge#LearningInPublic#30DaysOfAIML#EDA #DataVisualization #Python #pandas #seaborn #matplotlib #GitHub
Like Comment
To view or add a comment, sign in
michael mwangi
4w
Report this post
Building a Machine Learning Model for Time Series Forecasting Over the past few days, I’ve been working on a machine learning project focused on predicting future values using real-world financial data. 🔍 What I worked on: Data collection and preprocessing using pandas Feature engineering and handling missing values Implementing regression models such as Linear Regression Training and evaluating models using scikit-learn Using historical data to forecast future trends Visualizing predictions with matplotlib 📊 Key Techniques Applied: Data cleaning and transformation Train-test splitting Model training and evaluation Time series forecasting using shifted labels Scaling features for better model performance 📈 What I achieved: Built a working model that predicts future values based on historical patterns Compared actual vs predicted results using visual plots Gained deeper understanding of how machine learning models learn from data 💡 Key takeaway: Machine learning is not just about building models—it’s about understanding data, preparing it properly, and interpreting results effectively. 🎯 Next steps: Improve model accuracy with advanced techniques Explore additional models and comparisons Build more real-world projects and expand my portfolio I’m excited to continue growing in Data Science and Machine Learning and apply these skills to real-world problems. #MachineLearning #DataScience #Python #AI #DataAnalysis #LearningJourney
Like Comment
To view or add a comment, sign in
Kush Pohane
3w
Report this post
🚀 Day 4 Complete – Real AI/ML Engineering Begins Today I learned something most beginners ignore 👇 👉 Machine Learning is NOT just about models. It’s about data preparation. 💡 In fact: 80% of ML work = Cleaning, transforming & understanding data Only 20% = Model building 🔧 What I implemented today: ✔ Data Cleaning using Pandas (handling missing values) ✔ Data Imputation (Mean & Median techniques) ✔ Feature Scaling using MinMaxScaler ✔ Exploratory Data Analysis (EDA) • Heatmap • Pairplot • Histogram • Boxplot 🐞 Real Bug I Faced: Tried saving files → got directory error Fix? 👉 Learned to handle file systems like a real developer using os.makedirs() 🧠 Key Insight: Bad data = Bad model Clean data = Powerful predictions 📊 Biggest Learning: Visualization helped me see patterns instead of guessing them ✔ Experience strongly impacts Salary ✔ All features showed positive correlation ✔ Dataset was clean with no major outliers 🚀 This journey is changing my mindset: From writing code ➡ to thinking like an engineer #AI #MachineLearning #DataScience #LearningInPublic #Python #GitHub #EDA #100DaysOfCode #TechJourney
Like Comment
To view or add a comment, sign in
Muhammad Zain
1w
Report this post
Day 23: Real-World Data Ingestion & Feature Extraction in Pandas 🐍🤖 To build autonomous Agents and robust RAG pipelines, you need a flawless data foundation. Today, I completed my Pandas deep dive, shifting away from theory and executing end-to-end data extraction on messy, real-world datasets. Here are the core engineering takeaways from the final projects: 🌍 Real-World Data Ingestion: Handled importing and profiling massive .csv datasets. In Generative AI, this is step zero. Before an LLM can process a document, the raw data must be loaded and structured cleanly into memory. 🧩 Advanced Feature Extraction: Applied custom Python functions across unstructured text columns to parse hidden variables and generate brand-new, clean data points. This is exactly how you generate high-quality metadata to enrich documents before feeding them into a Vector Database. 🔎 Precision Querying: Chaining operations like .loc, .nlargest(), and conditional masking to extract highly specific insights. When building Agentic AI, writing this backend logic is how you give an Agent a functional "Database Search" tool. With NumPy (matrix math) and Pandas (data wrangling) officially locked in, the computational architecture is set. It is finally time to start building the "brain". #Python #GenAI #AgenticAI #MachineLearning #Pandas #LangChain #DataEngineering #100DaysOfCode
Like Comment
To view or add a comment, sign in
Kunal kumar
2w
Report this post
🚀 Day 19 of My AI & Machine Learning Journey Today I learned about one of the most important concepts in data analysis — Pandas DataFrame. 💡 A DataFrame is like a table (rows + columns), and each column is called a Series. 🔹 Creating DataFrame We can create DataFrame in different ways: Using List students_data = [[100,80,10],[90,70,7]] pd.DataFrame(students_data, columns=['iq','marks','package']) Using Dictionary data = {'iq':[100,90],'marks':[80,70],'package':[10,7]} pd.DataFrame(data) Using CSV (Real-world data) pd.read_csv('file.csv') 🔹 DataFrame Attributes • shape → number of rows & columns • dtypes → data types • columns → column names • values → actual data Example: movies.shape 🔹 Important Methods • head() → first rows • tail() → last rows • sample() → random rows • info() → dataset info • describe() → statistics Example: movies.head() movies.describe() 🔹 Handling Data • isnull().sum() → missing values • duplicated().sum() → duplicate rows • rename() → rename columns Example: students.rename(columns={'marks':'percent'}) 🔹 Mathematical Operations • sum() • mean() • median() Example: students.mean() students.sum(axis=1) 🔹 Selecting Data Single column → Series movies['title'] Multiple columns → DataFrame movies[['title','year']] 🔹 Setting Index We can set a column as index: students.set_index('name', inplace=True) 💡 Biggest Takeaway: DataFrame is the backbone of data analysis — every ML project starts with understanding data properly. Learning with practical examples 🚀 #MachineLearning #Python #Pandas #DataFrame #DataScience #LearningJourney #TechGrowth
1 Comment
Like Comment
To view or add a comment, sign in
Szimonetta Farkas
2w
Report this post
🔵 Machine learning project to predict California house prices using the Scikit-learn dataset 🔵 1. Data Loading I imported the California Housing dataset from Scikit-learn, converted it into a pandas DataFrame and added the target column (MedHouseValue) which represents median house prices. 2. Data Exploration I checked dataset structure, visualized distributions using histograms, checked relationships between features using a correlation heatmap. It helped me to understand which features might influence house prices and how variables are related to each other. 3. Split data into training and testing I separated features (X) and target (y). Split data into: 80% training, 20% testing. 4. Feature Scaling I used StandardScaler to normalize the features. Linear models perform better when features are on the same scale. It helps training stability. 5. Linear Regression I trained a Linear Regression model. The results: MAE ≈ 0.53 RMSE ≈ 0.72 R² ≈ 0.61 My model explains about 61% of the variation in house prices. Errors are moderate → predictions are okay but not great. The scatter plot showed predictions somewhat aligned, but not tightly. Linear regression is too simple to fully capture housing market complexity. 6. Random Forest I trained a Random Forest Regressor (ensemble of decision trees). Results: MAE ≈ 0.33 RMSE ≈ 0.50 R² ≈ 0.81 My model now explains about 81% of the variation. Errors are much smaller than Linear Regression. Predictions are much closer to actual values. 7. Conclusion Random Forest clearly performed better because: It captures non-linear relationships It handles complex feature interactions It is more flexible than linear models. #python #machinelearning #ml #datascience #ai #linearregression #randomforest #supervisedlearning #project #learning
Like Comment
To view or add a comment, sign in

539 followers

36 Posts

View Profile Connect

Mastering MultiIndex in Pandas for Higher Dimensional Data

More Relevant Posts

Explore content categories