Netfix data analytics using Python

Netfix data analytics using Python

Netflix is one of the most famous entertainment services. It provides services in over 190 countries with a variety of genres, games, TV shows and movies. In my project I drew some insights on popular Netflix categories using Python and Juypter notebook. I will share step-wise details on data preprocessing, data transformation and analytics using Python. Here we go.


Introduction

The dataset was sourced from Kaggle (link here). It has 1 file named netflix_titles.csv with 12 different columns ('show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description') that shows the variety of information of different categories.

Step 1: Import all necessary libraries and download the dataset netflix_titles.csv

import pandas as pd
import numpy as np
import panel as pn
from datetime import datetime 
import hvplot.pandas
import seaborn as sns
import matplotlib.pyplot as plt

# downloading the dataset
df = pd.read_csv('netflix_titles.csv', skipinitialspace=True)

# shape of dataset
'''After downloading the dataset, I inspected the data volume. I used the shape() function which returns the number of rows and columns in the dataframe. In this case, the dataframe had 8807 rows and 12 columns.'''
df.shape
(8807, 12)        

# Data information

Next, I peeked at the structure of the data. I used the info() function to print the dtype, columns , non null values and memory usage of the dataframe.

df.info()
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB        

# Data Preprocessing: Cleaning

I used the isna().sum() function that to sum the number of missing values in this dataset.

df.isna().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64        

We can see that the director, cast, country, date_added, rating and duration columns contain null values. So, I used the fillna() function to fill placeholder string values for these columns: rating,cast,country and director.

df.fillna({'rating':'Unavailable','cast':'Unavailable','country':'Unavailable','director':'Unavailable'},inplace=True)        

For date_added, I added the maximum date to fill the null values.

most_recent_date=df['date_added'].max() df.fillna({'date_added':most_recent_date},inplace=True)         

Exploratory analysis and Visualization

Next, we will be analyzing the data to uncover the following insights:

  1. Top 10 famous director of Netflix
  2. Top 5 famous genre on Netflix
  3. Distribution by Content Types
  4. Top 10 country of TV shows and movies
  5. Type vs Ratings on Netflix
  6. Number of content titles in the last 10 years
  7. Top 10 actors
  8. Top 5 durations based on the number of seasons in a TV show



Analytics

Top 10 famous director of Netflix

Director_count=df['director']
Director_count=df['director'].value_counts().head(10).reset_index(name='count')
Director_count.index = range(1,len(Director_count)+1)

plt.figure(figsize = (20,10))
plt.bar(Director_count['director'],Director_count['count'],color="green")
for i,count in enumerate(Director_count['count']):
    plt.text(i,count,str(count),ha='left',fontsize=15)
plt.xticks(rotation=85)
plt.xlabel('Director')
plt.ylabel('Values')
plt.title('Top Director on Netflix',fontsize =25)
plt.show()        
Article content
Rajiv Chilaka is the most popular director followed by Raul Capos,Jan Suter, Marcus Raboy and Suhas Kadav.

Top 5 famous genre on Netflix

plt.figure(figsize=(12,6))
sns.countplot(y='listed_in',order=df['listed_in'].value_counts().index[0:5],data=movie_countries)
plt.title('Top 5 genre')        
Article content
International movies and Dramas are the most popular genre followed by documentaries, stand up comedy.

Distribution by Content Types

plt.figure(figsize=(12,8))
plt.title("Distribution by Content Types ", fontsize=21)
p = plt.pie(df.type.value_counts(),explode=(0.025,0.025), labels=df.type.value_counts().index, colors=['red','grey'],  autopct='%1.1f%%', startangle=180)
plt.show()        
Article content
Netflix viewer are more interested in movies as compared to TV shows.

Top 10 country of TV shows and movies

country_count=df['country'].value_counts().head(20).reset_index(name='count')
country_count.index = range(1,len(country_counts)+1)
country_count

country	count
1	United States	2818
2	India	972
3	United Kingdom	419
4	Japan	245
5	South Korea	199
6	Canada	181
7	Spain	145
8	France	124
9	Mexico	110
10	Egypt	106
11	Turkey	105
12	Nigeria	95
13	Australia	87
14	Taiwan	81
15	Indonesia	79
16	Brazil	77
17	Philippines	75
18	United Kingdom, United States	75
19	United States, Canada	73
20	Germany	67

plt.figure(figsize = (20,10))
plt.bar(country_count['country'],country_count['count'],color = 'lightblue')
for i,count in enumerate(country_count['count']):
    plt.text(i,count+10,str(count),ha='center',fontsize=13)
plt.xticks(rotation=40)
plt.xlabel('Country')
plt.ylabel('Values')
plt.title('Count by Country',fontsize =25)
plt.show()        
Article content
United states(2818) produces highest content, followed by India(972), United Kingdom(419), Japan(245) and South Korea(199).

Type vs Ratings on Netflix

Rating_count=df['rating']
Rating_count=df['rating'].value_counts().head(10).reset_index(name='count')
Rating_count.index = range(1,len(Rating_count)+1)
Rating_count

	rating	count
1	TV-MA	3207
2	TV-14	2160
3	TV-PG	863
4	R	799
5	PG-13	490
6	TV-Y7	334
7	TV-Y	307
8	PG	287
9	TV-G	220
10	NR	80


plt.figure(figsize=(10,8))
sns.countplot(x='rating',order=df['rating'].value_counts().index[0:10],hue='type',data=df)

plt.xticks(rotation=85)
plt.ylabel('Values')
plt.title('Relation between Type and Rating')
plt.show()        
Article content
TV-MA rating is required for the highest number of content, followed by TV-14 and TV-PG

Number of content titles in the last 10 years

plt.figure(figsize=(10,8))
netflix_release_year = df['release_year'].value_counts()
netflix_release_year = pd.DataFrame(netflix_release_year).reset_index()
netflix_release_year.columns = ['release_year','title']
sns.barplot(x = 'release_year',y = 'title', data=netflix_release_year.head(10), saturation=.3)
plt.title('The Number of Content Titles in the Last 10 Years', fontsize=21)        
Article content
There has been a steady increase in the content since 2012, peeking at 2018, and then declining from 2018 onwards

Top 10 actors

Cast_count=df['cast']
Cast_count=df['cast'].value_counts().head(10).reset_index(name='count')
Cast_count.index = range(1,len(Cast_count)+1)

plt.figure(figsize = (20,10))
plt.bar(Cast_count['cast'],Cast_count['count'],color = 'yellow')
for i,count in enumerate(Cast_count['count']):
    plt.text(i,count,str(count),ha='left',fontsize=10)
plt.xticks(rotation=90)
plt.xlabel('Country')
plt.ylabel('Values')
plt.title('Top cast on Netflix',fontsize =25)
plt.show()        
Article content
David Attenborough has been casted in Netflix the most.

Top 5 duration based on the number of seasons in a TV show

plt.figure(figsize=(10,8))
netflix_duration = df['duration'].value_counts()
netflix_duration = pd.DataFrame(netflix_duration).reset_index()
netflix_duration.columns = ['duration','title']
sns.barplot(x = 'duration',y = 'title', data=netflix_duration.head(5), palette="cividis_r")
plt.title('Top 5 Durations Based on The Number of Titles', fontsize=21);        
Article content
TV shows with just 1 season have the highest viewership

Conclusion

Following are my insights from this Netflix data of 2012-2021

  1. Rajiv Chkka is most famous director on Netflix.
  2. Dramas and international movies are the most watched genres.
  3. There are twice more movies than TV shows
  4. United states produced the most content on Netflix
  5. TV-MA is highest rating
  6. Netflix produced the highest content in year 2018
  7. David Attenbourgh is the most famous actor on Netflix.
  8. TV shows  season 1 is the most watched followed by season 2 and season 3 and then 90 min and 94min movies.

Overall, this data analytics project was a lot of fun, learning and insightful. I'll be sharing my code on Github soon. I'll be posting more projects soon. Thank you for your time!



To view or add a comment, sign in

More articles by Dr. Priyanka Jain

Others also viewed

Explore content categories