An attempt on analyzing Netflix data

Sufyan Khot
4 min readFeb 12, 2020

The data set that I have attempted to analyze has the following features:

  1. show_id: A unique id given
  2. type: The type is either “Movie” or “TV Show”
  3. title: The title of the show
  4. director: Directors of the movie/TV show (comma separated)
  5. cast: Entire cast of the show/movie (comma separated)
  6. country: The countries where the show/movie was released (comma separated)
  7. date_added: This is the date the movie or TV show was added on Netflix
  8. release_year: The year of release of the movie/TV show
  9. rating: TV rating
  10. duration: Duration is in minutes
  11. listed_in: The possible genres (comma separated)
  12. description

Following image is the snapshot of the data set

A snapshot of the data set

First finding the proportion of Movies and TV shows

Implementing pie chart for showing proportions of Movies and TV shows
Pie chart showing the proportion of movies and TV shows on Netflix

There are 4 columns which has comma separated values: director, cast, country and listed_in. To process these columns, a function is implemented which splits each text value as a separate column with 1’s and 0’s. For example, if a movie has value “ Children & Family Movies, Comedies”, it will make 2 columns “Children & Family Movies” and “Comedies” with both having 1’s for this movie’s row. The function takes only one parameter which is column’s name.

Function to split each text value as a column

The above function returns a sorted dataframe which has the count of 1’s of each column. Thus we get the number of occurrences of each value in the entire original column.

For example, applying this function on director column we get the number of occurrence of each director in that column. With this data we can get the number of movies or TV shows of each director hosted on Netflix. Below are the top 5 directors according to the number of movies or TV shows directed which are on Netflix.

Finding the top 5 directors according to the number of TV shows/movies directed

Below are the results:

Top 5 directors

Similarly, plotting the top 10 genres:

Below are the results

Bar plot of top genres according to movies/TV shows count

Top countries according to number of movies/TV shows released:

Top countries according to number of movies/TV shows released

To plot the distribution of movie time, first converted the minutes(string) to seconds.

We can see that most number of movie time is in the range of 5000 secs to 7500 secs(1 hrs 38 mins to 2 hrs approximately).

Plotting the number of seasons count to get an idea of the number of seasons of all TV shows hosted

Most of the TV shows have one to two seasons only. While some of the shows have up to 15 seasons.

Plotting the number of movies/TV shows hosted as per their release year

Time plot of movies according to release year

Plotting the number of movies and TV shows added over the year can help us give insights whether more movies or TV shows are added in the recent years.

Graph of number of movies and TV shows added over the years

We can see in the graph that there is an upward trend for both movies and TV shows.

The data set can be further joined with the IMDB data set to give more insights.

Github link for code and data: https://github.com/Sufi737/Netflix-data-analysis

--

--