An attempt on analyzing Netflix data
The data set that I have attempted to analyze has the following features:
- show_id: A unique id given
- type: The type is either “Movie” or “TV Show”
- title: The title of the show
- director: Directors of the movie/TV show (comma separated)
- cast: Entire cast of the show/movie (comma separated)
- country: The countries where the show/movie was released (comma separated)
- date_added: This is the date the movie or TV show was added on Netflix
- release_year: The year of release of the movie/TV show
- rating: TV rating
- duration: Duration is in minutes
- listed_in: The possible genres (comma separated)
- description
Following image is the snapshot of the data set
First finding the proportion of Movies and TV shows
There are 4 columns which has comma separated values: director, cast, country and listed_in. To process these columns, a function is implemented which splits each text value as a separate column with 1’s and 0’s. For example, if a movie has value “ Children & Family Movies, Comedies”, it will make 2 columns “Children & Family Movies” and “Comedies” with both having 1’s for this movie’s row. The function takes only one parameter which is column’s name.
The above function returns a sorted dataframe which has the count of 1’s of each column. Thus we get the number of occurrences of each value in the entire original column.
For example, applying this function on director column we get the number of occurrence of each director in that column. With this data we can get the number of movies or TV shows of each director hosted on Netflix. Below are the top 5 directors according to the number of movies or TV shows directed which are on Netflix.
Below are the results:
Similarly, plotting the top 10 genres:
Below are the results
Top countries according to number of movies/TV shows released:
To plot the distribution of movie time, first converted the minutes(string) to seconds.
We can see that most number of movie time is in the range of 5000 secs to 7500 secs(1 hrs 38 mins to 2 hrs approximately).
Plotting the number of seasons count to get an idea of the number of seasons of all TV shows hosted
Most of the TV shows have one to two seasons only. While some of the shows have up to 15 seasons.
Plotting the number of movies/TV shows hosted as per their release year
Plotting the number of movies and TV shows added over the year can help us give insights whether more movies or TV shows are added in the recent years.
We can see in the graph that there is an upward trend for both movies and TV shows.
The data set can be further joined with the IMDB data set to give more insights.
Github link for code and data: https://github.com/Sufi737/Netflix-data-analysis