Getting Started with Pandas in Python

Mostafa Shariare

Published Sep 20, 2025

📖 Introduction

Pandas is an open-source Python library designed for data manipulation, analysis, and cleaning. It provides flexible and efficient data structures—primarily the Series (one-dimensional) and DataFrame (two-dimensional)—that make working with structured data intuitive and fast.

With Pandas, you can:

Import data from a variety of file formats (CSV, Excel, SQL, JSON, etc.)
Perform data wrangling tasks such as filtering, grouping, reshaping, and merging
Handle missing data and perform statistical operations with ease
Explore, visualize, and prepare datasets for machine learning or reporting

In this tutorial, we will walk through the fundamental operations in Pandas step by step. To make concepts more concrete, we will use demo datasets inspired by the California Housing dataset and the MNIST digit recognition dataset. Along the way, we will show both code snippets and example outputs, ensuring you can follow along and replicate the workflow in your own projects.

1. Importing Libraries

import pandas as pd
import numpy as np

2. Loading a Demo Dataset

In practice, we might load the California Housing dataset using scikit-learn. Here, we simulate a smaller demo dataset:

california_demo = pd.DataFrame({
    "MedInc": [8.3, 7.1, 6.5, 2.3, 3.8],
    "HouseAge": [41, 21, 52, 12, 18],
    "AveRooms": [6.5, 7.3, 5.9, 4.1, 6.0],
    "AveOccup": [2.5, 3.1, 2.0, 1.8, 2.2],
    "Latitude": [34.19, 36.77, 33.84, 35.48, 32.82]
})

california_demo.head()

Output:

   MedInc  HouseAge  AveRooms  AveOccup  Latitude
0     8.3        41      6.5       2.5     34.19
1     7.1        21      7.3       3.1     36.77
2     6.5        52      5.9       2.0     33.84
3     2.3        12      4.1       1.8     35.48
4     3.8        18      6.0       2.2     32.82

3. Shape of DataFrame

california_demo.shape

Output:

(5, 5)

This dataset has 5 rows and 5 columns.

4. Reading CSV Files

For demonstration, let’s create a mini MNIST-style dataset:

mnist_demo = pd.DataFrame({
    "label": [7, 2, 1, 0, 4],
    "pixel1": [0, 0, 0, 0, 0],
    "pixel2": [0, 255, 128, 64, 32],
    "pixel3": [0, 0, 0, 0, 0]
})

mnist_demo.head()

Output:

   label  pixel1  pixel2  pixel3
0      7       0       0       0
1      2       0     255       0
2      1       0     128       0
3      0       0      64       0
4      4       0      32       0

5. Exporting to CSV

california_demo.to_csv("california_demo.csv", index=False)

This writes the DataFrame to a CSV file.

6. Creating a Random DataFrame

random_df = pd.DataFrame(np.random.rand(4, 3))
random_df

Output:

          0         1         2
0  0.238485  0.876123  0.132942
1  0.653210  0.481772  0.903324
2  0.120984  0.392114  0.725892
3  0.994812  0.211847  0.554008

7. Viewing Rows

california_demo.head(3)   # first 3 rows

Output:

   MedInc  HouseAge  AveRooms  AveOccup  Latitude
0     8.3        41      6.5       2.5     34.19
1     7.1        21      7.3       3.1     36.77
2     6.5        52      5.9       2.0     33.84

california_demo.tail(2)   # last 2 rows

Output:

   MedInc  HouseAge  AveRooms  AveOccup  Latitude
3     2.3        12      4.1       1.8     35.48
4     3.8        18      6.0       2.2     32.82

8. DataFrame Info

california_demo.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   MedInc    5 non-null      float64
 1   HouseAge  5 non-null      int64  
 2   AveRooms  5 non-null      float64
 3   AveOccup  5 non-null      float64
 4   Latitude  5 non-null      float64

9. Missing Values

california_demo.isnull().sum()

Output:

MedInc      0
HouseAge    0
AveRooms    0
AveOccup    0
Latitude    0
dtype: int64

10. Value Counts

mnist_demo.value_counts("label")

Output:

label
0    1
1    1
2    1
4    1
7    1
Name: count, dtype: int64

11. Grouping Data

california_demo.groupby("HouseAge").mean()

Output:

          MedInc  AveRooms  AveOccup  Latitude
HouseAge                                        
12          2.3      4.1      1.8     35.48
18          3.8      6.0      2.2     32.82
21          7.1      7.3      3.1     36.77
41          8.3      6.5      2.5     34.19
52          6.5      5.9      2.0     33.84

12. Statistical Functions

california_demo.count()

Output:

MedInc      5
HouseAge    5
AveRooms    5
AveOccup    5
Latitude    5
dtype: int64

california_demo.mean()

Output:

MedInc       5.6
HouseAge    28.8
AveRooms     6.0
AveOccup     2.32
Latitude    34.62
dtype: float64

california_demo.describe()

Output:

         MedInc   HouseAge   AveRooms   AveOccup   Latitude
count  5.000000   5.000000   5.000000   5.000000   5.000000
mean   5.600000  28.800000   6.000000   2.320000  34.620000
std    2.456074  16.856634   1.204159   0.507939   1.462207
min    2.300000  12.000000   4.100000   1.800000  32.820000
25%    3.800000  18.000000   5.900000   2.000000  33.840000
50%    6.500000  21.000000   6.000000   2.200000  34.190000
75%    7.100000  41.000000   6.500000   2.500000  35.480000
max    8.300000  52.000000   7.300000   3.100000  36.770000

13. Adding and Dropping Columns

california_demo["Price"] = [300, 250, 200, 150, 180]
california_demo.head()

Output:

   MedInc  HouseAge  AveRooms  AveOccup  Latitude  Price
0     8.3        41      6.5       2.5     34.19    300
1     7.1        21      7.3       3.1     36.77    250
2     6.5        52      5.9       2.0     33.84    200
3     2.3        12      4.1       1.8     35.48    150
4     3.8        18      6.0       2.2     32.82    180

california_demo.drop(columns="Price")

Removes the Price column.

14. Indexing and Selection

california_demo.iloc[2]

Output:

MedInc       6.50
HouseAge    52.00
AveRooms     5.90
AveOccup     2.00
Latitude    33.84
Price      200.00
Name: 2, dtype: float64

california_demo.iloc[:, 0]   # first column

Output:

0    8.3
1    7.1
2    6.5
3    2.3
4    3.8
Name: MedInc, dtype: float64

15. Correlation

california_demo.corr()

Output:

             MedInc  HouseAge  AveRooms  AveOccup  Latitude  Price
MedInc     1.000000 -0.330319  0.285767  0.741281  0.028006  0.961768
HouseAge  -0.330319  1.000000 -0.232931 -0.720892 -0.263736 -0.394999
AveRooms   0.285767 -0.232931  1.000000  0.265743  0.064827  0.332019
AveOccup   0.741281 -0.720892  0.265743  1.000000  0.409378  0.806145
Latitude   0.028006 -0.263736  0.064827  0.409378  1.000000  0.061529
Price      0.961768 -0.394999  0.332019  0.806145  0.061529  1.000000

To view or add a comment, sign in

Getting Started with Pandas in Python

Mostafa Shariare

📖 Introduction

1. Importing Libraries

2. Loading a Demo Dataset

3. Shape of DataFrame

4. Reading CSV Files

5. Exporting to CSV

6. Creating a Random DataFrame

7. Viewing Rows

8. DataFrame Info

9. Missing Values

10. Value Counts

11. Grouping Data

12. Statistical Functions

13. Adding and Dropping Columns

14. Indexing and Selection

15. Correlation

More articles by Mostafa Shariare

Explore content categories

📖 Introduction

1. Importing Libraries

2. Loading a Demo Dataset

3. Shape of DataFrame

4. Reading CSV Files

5. Exporting to CSV

6. Creating a Random DataFrame

7. Viewing Rows

8. DataFrame Info

9. Missing Values

10. Value Counts

11. Grouping Data

12. Statistical Functions

13. Adding and Dropping Columns

14. Indexing and Selection

15. Correlation

More articles by Mostafa Shariare

WebSocket, HTTP vs WebSocket, and Socket.IO – Class Note

Virtual DOM, Fiber, and Reconciliation: The Core of React’s Performance

Understanding Calculation Order, Operator Precedence, Overflow, and Precision in C++.

Basic NumPy Tutorial in Python

What is Nodemailer and How to Use It (with Simple Code)

✌️ Two Pointer Algorithm in Python – Explained with Example

Recursion

What is Axios?

🎤 Creating a Simple Voice Assistant Using JavaScripts SpeechRecognition

What is JWT (JSON Web Token) and How Does It Work?

Explore content categories