Getting Started with Pandas in Python
📖 Introduction
Pandas is an open-source Python library designed for data manipulation, analysis, and cleaning. It provides flexible and efficient data structures—primarily the Series (one-dimensional) and DataFrame (two-dimensional)—that make working with structured data intuitive and fast.
With Pandas, you can:
In this tutorial, we will walk through the fundamental operations in Pandas step by step. To make concepts more concrete, we will use demo datasets inspired by the California Housing dataset and the MNIST digit recognition dataset. Along the way, we will show both code snippets and example outputs, ensuring you can follow along and replicate the workflow in your own projects.
1. Importing Libraries
import pandas as pd
import numpy as np
2. Loading a Demo Dataset
In practice, we might load the California Housing dataset using scikit-learn. Here, we simulate a smaller demo dataset:
california_demo = pd.DataFrame({
"MedInc": [8.3, 7.1, 6.5, 2.3, 3.8],
"HouseAge": [41, 21, 52, 12, 18],
"AveRooms": [6.5, 7.3, 5.9, 4.1, 6.0],
"AveOccup": [2.5, 3.1, 2.0, 1.8, 2.2],
"Latitude": [34.19, 36.77, 33.84, 35.48, 32.82]
})
california_demo.head()
Output:
MedInc HouseAge AveRooms AveOccup Latitude
0 8.3 41 6.5 2.5 34.19
1 7.1 21 7.3 3.1 36.77
2 6.5 52 5.9 2.0 33.84
3 2.3 12 4.1 1.8 35.48
4 3.8 18 6.0 2.2 32.82
3. Shape of DataFrame
california_demo.shape
Output:
(5, 5)
This dataset has 5 rows and 5 columns.
4. Reading CSV Files
For demonstration, let’s create a mini MNIST-style dataset:
mnist_demo = pd.DataFrame({
"label": [7, 2, 1, 0, 4],
"pixel1": [0, 0, 0, 0, 0],
"pixel2": [0, 255, 128, 64, 32],
"pixel3": [0, 0, 0, 0, 0]
})
mnist_demo.head()
Output:
label pixel1 pixel2 pixel3
0 7 0 0 0
1 2 0 255 0
2 1 0 128 0
3 0 0 64 0
4 4 0 32 0
5. Exporting to CSV
california_demo.to_csv("california_demo.csv", index=False)
This writes the DataFrame to a CSV file.
6. Creating a Random DataFrame
random_df = pd.DataFrame(np.random.rand(4, 3))
random_df
Output:
0 1 2
0 0.238485 0.876123 0.132942
1 0.653210 0.481772 0.903324
2 0.120984 0.392114 0.725892
3 0.994812 0.211847 0.554008
7. Viewing Rows
california_demo.head(3) # first 3 rows
Output:
MedInc HouseAge AveRooms AveOccup Latitude
0 8.3 41 6.5 2.5 34.19
1 7.1 21 7.3 3.1 36.77
2 6.5 52 5.9 2.0 33.84
california_demo.tail(2) # last 2 rows
Output:
MedInc HouseAge AveRooms AveOccup Latitude
3 2.3 12 4.1 1.8 35.48
4 3.8 18 6.0 2.2 32.82
8. DataFrame Info
california_demo.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MedInc 5 non-null float64
1 HouseAge 5 non-null int64
2 AveRooms 5 non-null float64
3 AveOccup 5 non-null float64
4 Latitude 5 non-null float64
9. Missing Values
california_demo.isnull().sum()
Output:
MedInc 0
HouseAge 0
AveRooms 0
AveOccup 0
Latitude 0
dtype: int64
10. Value Counts
mnist_demo.value_counts("label")
Output:
label
0 1
1 1
2 1
4 1
7 1
Name: count, dtype: int64
11. Grouping Data
california_demo.groupby("HouseAge").mean()
Output:
MedInc AveRooms AveOccup Latitude
HouseAge
12 2.3 4.1 1.8 35.48
18 3.8 6.0 2.2 32.82
21 7.1 7.3 3.1 36.77
41 8.3 6.5 2.5 34.19
52 6.5 5.9 2.0 33.84
12. Statistical Functions
california_demo.count()
Output:
MedInc 5
HouseAge 5
AveRooms 5
AveOccup 5
Latitude 5
dtype: int64
california_demo.mean()
Output:
MedInc 5.6
HouseAge 28.8
AveRooms 6.0
AveOccup 2.32
Latitude 34.62
dtype: float64
california_demo.describe()
Output:
MedInc HouseAge AveRooms AveOccup Latitude
count 5.000000 5.000000 5.000000 5.000000 5.000000
mean 5.600000 28.800000 6.000000 2.320000 34.620000
std 2.456074 16.856634 1.204159 0.507939 1.462207
min 2.300000 12.000000 4.100000 1.800000 32.820000
25% 3.800000 18.000000 5.900000 2.000000 33.840000
50% 6.500000 21.000000 6.000000 2.200000 34.190000
75% 7.100000 41.000000 6.500000 2.500000 35.480000
max 8.300000 52.000000 7.300000 3.100000 36.770000
13. Adding and Dropping Columns
california_demo["Price"] = [300, 250, 200, 150, 180]
california_demo.head()
Output:
MedInc HouseAge AveRooms AveOccup Latitude Price
0 8.3 41 6.5 2.5 34.19 300
1 7.1 21 7.3 3.1 36.77 250
2 6.5 52 5.9 2.0 33.84 200
3 2.3 12 4.1 1.8 35.48 150
4 3.8 18 6.0 2.2 32.82 180
california_demo.drop(columns="Price")
Removes the Price column.
14. Indexing and Selection
california_demo.iloc[2]
Output:
MedInc 6.50
HouseAge 52.00
AveRooms 5.90
AveOccup 2.00
Latitude 33.84
Price 200.00
Name: 2, dtype: float64
california_demo.iloc[:, 0] # first column
Output:
0 8.3
1 7.1
2 6.5
3 2.3
4 3.8
Name: MedInc, dtype: float64
15. Correlation
california_demo.corr()
Output:
MedInc HouseAge AveRooms AveOccup Latitude Price
MedInc 1.000000 -0.330319 0.285767 0.741281 0.028006 0.961768
HouseAge -0.330319 1.000000 -0.232931 -0.720892 -0.263736 -0.394999
AveRooms 0.285767 -0.232931 1.000000 0.265743 0.064827 0.332019
AveOccup 0.741281 -0.720892 0.265743 1.000000 0.409378 0.806145
Latitude 0.028006 -0.263736 0.064827 0.409378 1.000000 0.061529
Price 0.961768 -0.394999 0.332019 0.806145 0.061529 1.000000