Netfix data analytics using Python

Netflix is one of the most famous entertainment services. It provides services in over 190 countries with a variety of genres, games, TV shows and movies. In my project I drew some insights on popular Netflix categories using Python and Juypter notebook. I will share step-wise details on data preprocessing, data transformation and analytics using Python. Here we go.

Introduction

The dataset was sourced from Kaggle (link here). It has 1 file named netflix_titles.csv with 12 different columns ('show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description') that shows the variety of information of different categories.

Step 1: Import all necessary libraries and download the dataset netflix_titles.csv

import pandas as pd
import numpy as np
import panel as pn
from datetime import datetime 
import hvplot.pandas
import seaborn as sns
import matplotlib.pyplot as plt

# downloading the dataset
df = pd.read_csv('netflix_titles.csv', skipinitialspace=True)

# shape of dataset
'''After downloading the dataset, I inspected the data volume. I used the shape() function which returns the number of rows and columns in the dataframe. In this case, the dataframe had 8807 rows and 12 columns.'''
df.shape
(8807, 12)

# Data information

Next, I peeked at the structure of the data. I used the info() function to print the dtype, columns , non null values and memory usage of the dataframe.

df.info()
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB

# Data Preprocessing: Cleaning

I used the isna().sum() function that to sum the number of missing values in this dataset.

df.isna().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

We can see that the director, cast, country, date_added, rating and duration columns contain null values. So, I used the fillna() function to fill placeholder string values for these columns: rating,cast,country and director.

df.fillna({'rating':'Unavailable','cast':'Unavailable','country':'Unavailable','director':'Unavailable'},inplace=True)

For date_added, I added the maximum date to fill the null values.

most_recent_date=df['date_added'].max() df.fillna({'date_added':most_recent_date},inplace=True)

Exploratory analysis and Visualization

Next, we will be analyzing the data to uncover the following insights:

Top 10 famous director of Netflix
Top 5 famous genre on Netflix
Distribution by Content Types
Top 10 country of TV shows and movies
Type vs Ratings on Netflix
Number of content titles in the last 10 years
Top 10 actors
Top 5 durations based on the number of seasons in a TV show

Analytics

Top 10 famous director of Netflix

Director_count=df['director']
Director_count=df['director'].value_counts().head(10).reset_index(name='count')
Director_count.index = range(1,len(Director_count)+1)

plt.figure(figsize = (20,10))
plt.bar(Director_count['director'],Director_count['count'],color="green")
for i,count in enumerate(Director_count['count']):
    plt.text(i,count,str(count),ha='left',fontsize=15)
plt.xticks(rotation=85)
plt.xlabel('Director')
plt.ylabel('Values')
plt.title('Top Director on Netflix',fontsize =25)
plt.show()

Top 5 famous genre on Netflix

plt.figure(figsize=(12,6))
sns.countplot(y='listed_in',order=df['listed_in'].value_counts().index[0:5],data=movie_countries)
plt.title('Top 5 genre')

Distribution by Content Types

plt.figure(figsize=(12,8))
plt.title("Distribution by Content Types ", fontsize=21)
p = plt.pie(df.type.value_counts(),explode=(0.025,0.025), labels=df.type.value_counts().index, colors=['red','grey'],  autopct='%1.1f%%', startangle=180)
plt.show()

Top 10 country of TV shows and movies

country_count=df['country'].value_counts().head(20).reset_index(name='count')
country_count.index = range(1,len(country_counts)+1)
country_count

country	count
1	United States	2818
2	India	972
3	United Kingdom	419
4	Japan	245
5	South Korea	199
6	Canada	181
7	Spain	145
8	France	124
9	Mexico	110
10	Egypt	106
11	Turkey	105
12	Nigeria	95
13	Australia	87
14	Taiwan	81
15	Indonesia	79
16	Brazil	77
17	Philippines	75
18	United Kingdom, United States	75
19	United States, Canada	73
20	Germany	67

plt.figure(figsize = (20,10))
plt.bar(country_count['country'],country_count['count'],color = 'lightblue')
for i,count in enumerate(country_count['count']):
    plt.text(i,count+10,str(count),ha='center',fontsize=13)
plt.xticks(rotation=40)
plt.xlabel('Country')
plt.ylabel('Values')
plt.title('Count by Country',fontsize =25)
plt.show()

Type vs Ratings on Netflix

Rating_count=df['rating']
Rating_count=df['rating'].value_counts().head(10).reset_index(name='count')
Rating_count.index = range(1,len(Rating_count)+1)
Rating_count

	rating	count
1	TV-MA	3207
2	TV-14	2160
3	TV-PG	863
4	R	799
5	PG-13	490
6	TV-Y7	334
7	TV-Y	307
8	PG	287
9	TV-G	220
10	NR	80


plt.figure(figsize=(10,8))
sns.countplot(x='rating',order=df['rating'].value_counts().index[0:10],hue='type',data=df)

plt.xticks(rotation=85)
plt.ylabel('Values')
plt.title('Relation between Type and Rating')
plt.show()

Number of content titles in the last 10 years

plt.figure(figsize=(10,8))
netflix_release_year = df['release_year'].value_counts()
netflix_release_year = pd.DataFrame(netflix_release_year).reset_index()
netflix_release_year.columns = ['release_year','title']
sns.barplot(x = 'release_year',y = 'title', data=netflix_release_year.head(10), saturation=.3)
plt.title('The Number of Content Titles in the Last 10 Years', fontsize=21)

Top 10 actors

Cast_count=df['cast']
Cast_count=df['cast'].value_counts().head(10).reset_index(name='count')
Cast_count.index = range(1,len(Cast_count)+1)

plt.figure(figsize = (20,10))
plt.bar(Cast_count['cast'],Cast_count['count'],color = 'yellow')
for i,count in enumerate(Cast_count['count']):
    plt.text(i,count,str(count),ha='left',fontsize=10)
plt.xticks(rotation=90)
plt.xlabel('Country')
plt.ylabel('Values')
plt.title('Top cast on Netflix',fontsize =25)
plt.show()

Top 5 duration based on the number of seasons in a TV show

plt.figure(figsize=(10,8))
netflix_duration = df['duration'].value_counts()
netflix_duration = pd.DataFrame(netflix_duration).reset_index()
netflix_duration.columns = ['duration','title']
sns.barplot(x = 'duration',y = 'title', data=netflix_duration.head(5), palette="cividis_r")
plt.title('Top 5 Durations Based on The Number of Titles', fontsize=21);

Conclusion

Following are my insights from this Netflix data of 2012-2021

Rajiv Chkka is most famous director on Netflix.
Dramas and international movies are the most watched genres.
There are twice more movies than TV shows
United states produced the most content on Netflix
TV-MA is highest rating
Netflix produced the highest content in year 2018
David Attenbourgh is the most famous actor on Netflix.
TV shows season 1 is the most watched followed by season 2 and season 3 and then 90 min and 94min movies.

Overall, this data analytics project was a lot of fun, learning and insightful. I'll be sharing my code on Github soon. I'll be posting more projects soon. Thank you for your time!

Netfix data analytics using Python

Dr. Priyanka Jain

Introduction

Analytics

Recommended by LinkedIn

Conclusion

More articles by Dr. Priyanka Jain

Others also viewed

Working with Categorical Predictors

Python Descriptive Statistics

Text analysis on WEB 3.0 using R & Python.

Time Series Analysis using Unobserved Components Model in Python

Mastering Python Regular Expressions: A Beginner’s Guide with Real Examples

Introduction to exponential Smoothing for Time Series Forecasting using Python

Revolutionizing Price Comparison with Python and Machine Learning: A Data-Driven Approach to Unmatched Savings

A Pipeline and a Prompt: Automating Document Processing with LLMs and Python

Exploratory Data Analysis (EDA) and encoding for tabular data along with python code.

Cleaning Text Data

Explore content categories

Introduction

Analytics

Recommended by LinkedIn

Conclusion

More articles by Dr. Priyanka Jain

Geospatial Data Analysis Project: U.S. Education Inequality

Gold & silver price variation due to geo-political events

Crisis Impact Analysis (Codebasics Resume Project Challenge #23)

Google Merchandise Sales Analysis using Python

Motor Vehicle Theft Analysis using Python

Others also viewed

Working with Categorical Predictors

Python Descriptive Statistics

Text analysis on WEB 3.0 using R & Python.

Time Series Analysis using Unobserved Components Model in Python

Mastering Python Regular Expressions: A Beginner’s Guide with Real Examples

Introduction to exponential Smoothing for Time Series Forecasting using Python

Revolutionizing Price Comparison with Python and Machine Learning: A Data-Driven Approach to Unmatched Savings

A Pipeline and a Prompt: Automating Document Processing with LLMs and Python

Exploratory Data Analysis (EDA) and encoding for tabular data along with python code.

Cleaning Text Data

Explore content categories