IMDB MOVIE RATING PREDICTOR
Scenario
Examine what factors lead to certain ratings for movies. Given that Netflix does not currently store this type of data, your boss has suggested that you collect ratings and reviews data from IMDB.
- Netflix uses random forests and decision trees to predict what types of movies an individual user may like.
- Using unsupervised learning techniques, they are able to continually update suggestions, listings, and other features of its user interface.
- Netflix, however, hasn't focused on collecting data on the top movies of all time, and would like to add some of them to their offerings based on popularity and other factors.
Prompt:
Predict whether a movie is highly rated.
Collect the data, build a random forest, and examine its feature importances to understand what factors contribute to ratings.
Questions to keep in mind:
- What factors are the most direct predictors of rating?
- You can use rating as your target variable. But it's up to you whether to treat it as continuous, binary, or multiclass.
Risks & Assumptions:
Gross revenue, if unavailable in the scraped data, was set to 0. Some columns (features) were removed for reasons that seemed irrelevant to the analysis (deemed by my opinion/ knowledge)
Web Scraping - IMDB
By using a library called requests, the top 250 IMDB movies were scraped from http://www.imdb.com/chart/top. Then from the top 250 movies from IMDB, using an online API called OMDb (http://www.omdbapi.com/), a function was created to get details about each of the 250 top movies on IMDB. Seperately, gross revenue was scrapped for each movie using the IMDB ID and put into a seperate dataframe.
Data Cleaning/ Munging
After converting the scraped data into a dataframe, some of the data containing “N/A” was replaced with NaN, to easily deal with null values. Next, the column “Released” date was converted to a datetime type. Next the “Runtime” column was cleaned by stripping “min” and converting to an integer. The “Year” column was converted to type integer and the “Metascore” column was converted to a float type. For the column containing “imdbVotes”, the commas were stripped and converted to integer types. After these changes, The gross revenue dataframe and the IMDB dataframes were combined.
Next, by iterating through all the columns with NaNs, the number of NaN in each column were found. Metascore had the most NaN’s with 81 and Gross (revenue) came in second with 66. The other columns with NaN’s were much lower with Awards having 4, Poster having 2, and Language, Rated, Released, all having 1. First, the one NaN in language was the movie, “Sunrise”. After doing some research, it was known to be an “English” language movie and was changed as such. Next, the movie “Gangs of Wasseypur” had a NaN for rating and was changed to “UNRATED”. The remaining category with one NaN, the “Released” column, was “The Gold Rush”. After finding the release date from outside resources, the released date was changed to ‘1925-06-26’. Since there was too many NaN’s in the “Gross” (revenue) column, zeros (0) were filled in for NaN values in the “Gross” column.
Next, columns were dropped because of varying reasons. The awards column was dropped because of issues of colinearity between ratings and awards won. The poster column was dropped because these were images of the poster and not relevant to the analysis at hand. Metascore was dropped because of colinearity and also the copious amount of NaN values, which would make the analysis difficult. The imdbID column was dropped because it contained ids of the movies only relevant to IMDB. The type column had only one category, movies, and was dropped since we’re only dealing with movies. The writer was dropped because there were too many writers per movie and believed to be somewhat irrelevant to the analysis. Lastly, the response column was dropped because of the singularity of the responses all saying “True”.
Text Vectorization
In order to treat each category or “word” as a feature easily and create columns (dummy variables), CountVectorization from the sklearn library is imported.
(It is important to note that before running models that one of the columns need to be dropped every time a column is dummified because of the colinearity of the categories)
First, using the count vectorizer, each type of genre was given a column and for each movie a 1 was given if the movie was that particular type of genre and a 0 if it wasn’t. Then since the genre music had one of the least amount of movies in its category, it was dropped. This step was repeated for actor, language, country, and director. Just as a column was dropped from genre, the columns, country: luxemborg and language: american sign language, were dropped. For director and actor, since movies could have more than one director or actor, these columns didn’t need to be dropped.
After concating the newly formed columnns with the existing, plot was dropped from the dataframe since plots greatly varied between movies and contains paragraphs of words. IMDBRating was converted to a float and duplicates were then dropped.
Data Analysis / EDA
Before running decision tree models, to get familiar and understand the data better, some exploratory analysis is necessary.
The heatmap shows imdbRatings and imdbVotes have a high positive correlation. The year of the movie and imdbVotes also seem to have a high correlation. The year of the movie and the imdb Rating though has an almost 0 correlation (neither negative or positive). Runtime shows to have very little correlation with any of the other features.
By looking at the top grossing movies in the IMDB top 250, it can be inferred the top five grossing movies are action related. Two Star Wars movies and two Batman movies seem to gross the most. The top 10 highest grossing movies seem to either be from cartoons (Disney), superhero movies, or part of series (Star Wars, Harry Potter, The Lord of the Rings).
The bar plot of the movie ratings of the IMDB top 250 movies reveals no movie is rated above a 9.5 (highest being The Shawshank Redemption at 9.3) and in the top 250, the lowest rating is an 8.0. The Godfather and The Godfather: Part II are number 2 and 3 in highest IMDB rating. Cult classics like Pulp Fiction, Shawshank Redemption, The Lord of the Rings, and the The Good, the Bad and the Ugly also rate very highly in the IMDB ratings.
Leonardo DiCaprio, Harrison Ford, and Robert De Niro are tied at 7 for the most appearances in the top 250 IMDB movies. Tom Hanks and Clint Eastwood are a close second.
This plot of the number of movies directed by number of directors shows from the top 250 IMDB movies, most fall in the 2-3 movies range with high IMDB ratings. Surprisingly there are 5 directors who have 5 movies in the top 250 movies. These include famous directors like alfred hitchcock, steven speilberg, stanley kubrick, christopher nolan, and martin scorsese.
The most popular language in the top 250 highest rated movies on IMDB is english. 211 out of 250 movies or 84% of the top 250 movies are English, which is almost 4 times as popular as the next most popular language, French. There could be a bias on IMDB from the fact that IMDB is a site founded in the UK, an English speaking country, and currently owned by Amazon, a company based in the United States (also English speaking).
The most popular genre for the top 250 IMDB movies is by far drama. It is important to remember though that movies can have multiple genres and many of these movies contain more than one genre. Genres like adventure, thriller, action, and crime with elements of high activity and suspense were common among the top 250 IMDB films.
Decision Trees
In order to run the decision tree analysis and predict IMDB ratings, first a standard cutoff point was necessary. To do this, the mean was set as the standard benchmark to predict whether the movie was above or below. The mean of 8.31, while may seem very low in the 8.0 to 9.3 range, is a good benchmark because it is representative of the skew of the movie ratings as they generally fall toward the lower spectrum towards 8.0 rather than the 9.3 rating.
Initially by first running a decision tree classifier model, IMDB Votes seems to be 4 times more effective in predicting IMDB rating than the next highest, the year of the movie. The model has a consistent precision, recall, and f1-score of .76. Gridsearching on the decision trees interestingly lowered the precision score to .74. While the recall and f1-scores dropped to .75. Eventually by bagging gridsearch decision trees, the precision score increased to .87 and the recall and f1-scores increased to .85. By bagging and gridsearching decision trees, the scores were significantly increased by about 10%.
A random forest model and an extra trees model was also run to see if the models would increase precision, recall, or the f1-scores. The random forest model had a precision score of .78 with a recall score of .73 and a f1 score of .69. The extra trees classifier had a precision score of .64, a recall score of .65, and a f1 score of .61. Both models were relatively below the decision tree scores except for the exception of the precision score of the random forest model, which scored higher than the decision tree model and the gridsearched decision trees. Furthermore gridsearching random forests resulted in a lower precision (.66), recall (.67), and f1-score (.62), just as gridsearching decision trees did.
Feature Importance
When comparing feature importances of the three varying models, decision trees, random forest, and extra trees, imdb votes consistently seems to be the best predictor for IMDB rating. While this may be the case, the impact imdb votes have is significantly higher in the decision tree model than the other two. Runtime seems to be a high factor in feature importance for the random forest and extra trees models, but not for decision trees. The random forest model and the extra tree model seem to be similar in terms of which features are more important for predicting rating. The above chart shows the differing feature importances among the varying models.
Accuracy of Models
Lastly, when finding the accuracy scores of each of the models, grid searching the random forest model seems to have the highest score. Coming in second is the grid searched bagged decision trees. Interestingly, grid searching decision trees has a lower accuracy score than decisions trees. While very similar in score, grid searching had very little effect on improving the accuracy score of the decision tree model.
Conclusion:
The factors that contribute most to the prediction of IMDB ratings are the IMDB votes. Throughout all the models, the more votes that a movie had, the more likely it was to be considered a higher rating. Year and gross revenue were also good indicators of rating for all the models. Runtime, while a high indicator of rating for random forest and extra trees, was not such a good indicator for vanilla decision tree models.
Not surprisingly, gridsearching random forest models and gridsearching bagged decision trees produced more accurate scores. These differences were negligible and many times it was a difference of at the most 3% and at the least .0001%. gridsearched, gridsearched and bagged, and vanilla decision trees seemed to have the highest margin of error when it came to accuracy scores though. Weirdly enough, extra trees were the lowest in terms of accuracy score and also had a higher erro than both random forest and gridsearched random forest models.
Ultimately, varying models could be more suited for differing types of analysis and there is no one model that is always the best to use. It is important to use and test different models to find the best performing and remembering that there is different ways to score a model is important also.