Netfix data analytics using Python
Netflix is one of the most famous entertainment services. It provides services in over 190 countries with a variety of genres, games, TV shows and movies. In my project I drew some insights on popular Netflix categories using Python and Juypter notebook. I will share step-wise details on data preprocessing, data transformation and analytics using Python. Here we go.
Introduction
The dataset was sourced from Kaggle (link here). It has 1 file named netflix_titles.csv with 12 different columns ('show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description') that shows the variety of information of different categories.
Step 1: Import all necessary libraries and download the dataset netflix_titles.csv
import pandas as pd
import numpy as np
import panel as pn
from datetime import datetime
import hvplot.pandas
import seaborn as sns
import matplotlib.pyplot as plt
# downloading the dataset
df = pd.read_csv('netflix_titles.csv', skipinitialspace=True)
# shape of dataset
'''After downloading the dataset, I inspected the data volume. I used the shape() function which returns the number of rows and columns in the dataframe. In this case, the dataframe had 8807 rows and 12 columns.'''
df.shape
(8807, 12)
# Data information
Next, I peeked at the structure of the data. I used the info() function to print the dtype, columns , non null values and memory usage of the dataframe.
df.info()
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 8807 non-null object
1 type 8807 non-null object
2 title 8807 non-null object
3 director 6173 non-null object
4 cast 7982 non-null object
5 country 7976 non-null object
6 date_added 8797 non-null object
7 release_year 8807 non-null int64
8 rating 8803 non-null object
9 duration 8804 non-null object
10 listed_in 8807 non-null object
11 description 8807 non-null object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
# Data Preprocessing: Cleaning
I used the isna().sum() function that to sum the number of missing values in this dataset.
df.isna().sum()
show_id 0
type 0
title 0
director 2634
cast 825
country 831
date_added 10
release_year 0
rating 4
duration 3
listed_in 0
description 0
dtype: int64
We can see that the director, cast, country, date_added, rating and duration columns contain null values. So, I used the fillna() function to fill placeholder string values for these columns: rating,cast,country and director.
df.fillna({'rating':'Unavailable','cast':'Unavailable','country':'Unavailable','director':'Unavailable'},inplace=True)
For date_added, I added the maximum date to fill the null values.
most_recent_date=df['date_added'].max() df.fillna({'date_added':most_recent_date},inplace=True)
Exploratory analysis and Visualization
Next, we will be analyzing the data to uncover the following insights:
Analytics
Top 10 famous director of Netflix
Director_count=df['director']
Director_count=df['director'].value_counts().head(10).reset_index(name='count')
Director_count.index = range(1,len(Director_count)+1)
plt.figure(figsize = (20,10))
plt.bar(Director_count['director'],Director_count['count'],color="green")
for i,count in enumerate(Director_count['count']):
plt.text(i,count,str(count),ha='left',fontsize=15)
plt.xticks(rotation=85)
plt.xlabel('Director')
plt.ylabel('Values')
plt.title('Top Director on Netflix',fontsize =25)
plt.show()
Top 5 famous genre on Netflix
plt.figure(figsize=(12,6))
sns.countplot(y='listed_in',order=df['listed_in'].value_counts().index[0:5],data=movie_countries)
plt.title('Top 5 genre')
Recommended by LinkedIn
Distribution by Content Types
plt.figure(figsize=(12,8))
plt.title("Distribution by Content Types ", fontsize=21)
p = plt.pie(df.type.value_counts(),explode=(0.025,0.025), labels=df.type.value_counts().index, colors=['red','grey'], autopct='%1.1f%%', startangle=180)
plt.show()
Top 10 country of TV shows and movies
country_count=df['country'].value_counts().head(20).reset_index(name='count')
country_count.index = range(1,len(country_counts)+1)
country_count
country count
1 United States 2818
2 India 972
3 United Kingdom 419
4 Japan 245
5 South Korea 199
6 Canada 181
7 Spain 145
8 France 124
9 Mexico 110
10 Egypt 106
11 Turkey 105
12 Nigeria 95
13 Australia 87
14 Taiwan 81
15 Indonesia 79
16 Brazil 77
17 Philippines 75
18 United Kingdom, United States 75
19 United States, Canada 73
20 Germany 67
plt.figure(figsize = (20,10))
plt.bar(country_count['country'],country_count['count'],color = 'lightblue')
for i,count in enumerate(country_count['count']):
plt.text(i,count+10,str(count),ha='center',fontsize=13)
plt.xticks(rotation=40)
plt.xlabel('Country')
plt.ylabel('Values')
plt.title('Count by Country',fontsize =25)
plt.show()
Type vs Ratings on Netflix
Rating_count=df['rating']
Rating_count=df['rating'].value_counts().head(10).reset_index(name='count')
Rating_count.index = range(1,len(Rating_count)+1)
Rating_count
rating count
1 TV-MA 3207
2 TV-14 2160
3 TV-PG 863
4 R 799
5 PG-13 490
6 TV-Y7 334
7 TV-Y 307
8 PG 287
9 TV-G 220
10 NR 80
plt.figure(figsize=(10,8))
sns.countplot(x='rating',order=df['rating'].value_counts().index[0:10],hue='type',data=df)
plt.xticks(rotation=85)
plt.ylabel('Values')
plt.title('Relation between Type and Rating')
plt.show()
Number of content titles in the last 10 years
plt.figure(figsize=(10,8))
netflix_release_year = df['release_year'].value_counts()
netflix_release_year = pd.DataFrame(netflix_release_year).reset_index()
netflix_release_year.columns = ['release_year','title']
sns.barplot(x = 'release_year',y = 'title', data=netflix_release_year.head(10), saturation=.3)
plt.title('The Number of Content Titles in the Last 10 Years', fontsize=21)
Top 10 actors
Cast_count=df['cast']
Cast_count=df['cast'].value_counts().head(10).reset_index(name='count')
Cast_count.index = range(1,len(Cast_count)+1)
plt.figure(figsize = (20,10))
plt.bar(Cast_count['cast'],Cast_count['count'],color = 'yellow')
for i,count in enumerate(Cast_count['count']):
plt.text(i,count,str(count),ha='left',fontsize=10)
plt.xticks(rotation=90)
plt.xlabel('Country')
plt.ylabel('Values')
plt.title('Top cast on Netflix',fontsize =25)
plt.show()
Top 5 duration based on the number of seasons in a TV show
plt.figure(figsize=(10,8))
netflix_duration = df['duration'].value_counts()
netflix_duration = pd.DataFrame(netflix_duration).reset_index()
netflix_duration.columns = ['duration','title']
sns.barplot(x = 'duration',y = 'title', data=netflix_duration.head(5), palette="cividis_r")
plt.title('Top 5 Durations Based on The Number of Titles', fontsize=21);
Conclusion
Following are my insights from this Netflix data of 2012-2021
Overall, this data analytics project was a lot of fun, learning and insightful. I'll be sharing my code on Github soon. I'll be posting more projects soon. Thank you for your time!