When I first explored the idea of scraping Netflix’s movie catalog, I imagined the process to be a challenge. I soon realized that Netflix, along with Hulu, takes extreme efforts into ensuring their data is not being scraped. It is still possible to do but I decided to look for alternative, easier routes. I discovered a website called Finder.com that lists all of the available movies currently on Netflix and their data was relatively easy to obtain using R’s rvest package. It took a little bit of cleaning but I was able to create a data frame containing the title, year, and genre of every movie on Netflix. In reality, the accuracy of this analysis relies on the reliability of this source. Still, we can see that the most popular genres of films are “Action & Adventure”, “Comedies” and “Dramas”. This discovery matches my expectations, however, I was surprised to see that there are more Bollywood movies than there are horror films.
The next thing I was interested in was getting a gauge on the quality of Netflix films. I personally refer to IMDB before watching a film and tend to avoid anything under a 7.0 rating. I had already downloaded the IMBD dataset for a previous project so it was just a matter combining the data using an inner join on the title, and year observations. Not every Netflix film was accounted for in the join but enough to continue with the analysis considering the stakes. I created a new boolean variable using the mutate() function to determine whether or not a film was rated greater than or equal to 7. After that I summarized the data to count all of the results.
A solid 36% of the current movies on Netflix are rated above a 7.0 which seems fair to me. While technically the majority of films do not pass my personal filter, I’m sure many others are less of a movie snob than myself.