In this project I decided to analyze movie data from The-Numbers in order to see if I could predict how a movie would do in terms of worldwide gross, given some relevant features. My main focus here was to use regression in order to accurately predict worldwide gross. I’ll be making a separate post on beautifulsoup4 and on how I actually tackled pulling data from the site which was quite tedious.
Initially, my hypothesis was that movies made in America air in theaters in America first and thus if they are successful in American theaters during their first few weeks in the box office, this would translate to worldwide success. I looked at ~3900 films and filled 500 cells with column averages which allowed me to have a larger dataset to train on. I considered movie budget, % change in domestic gross from weekend 1 to 2 and 2 to 3, average gross per theater for weekends 1 -3, number of theaters the movie played in per weekend for weekends 1-3, the Rotten Tomatoes critic and audience rating for the movie, the genre, and the MPAA rating. I turned MPAA rating and genre into dummy variables which I will show how to do as an edit to this post in the future.
Models and Performance
- Elastic Net on predicting WorldWide
- Elastic Net on predicting Domestic
- Random Forest Regressor on predicting WorldWide
- Random Forest Regressor on predicting Domestic
Just as a point of comparison, I used the same features to predict the domestic gross as well to show how the features were much more predictive for domestic over worldwide. Also of the two models I looked at, it is clear that the Random Forest regression is the more accurate of the two models. I will update this post with my methodology in parameter tuning as well.
I looked at the feature importance graph for my features and found that the percent change really didn’t matter at all. This made my realize that my original assumption that movies aired in America first was wrong. I looked up the release dates of many American movies and found that they were often released up to a month in advance in other countries! Furthermore, the gross per theater and number of theaters the movie was in seemed to be quite predictive as the feature importance graph will show.
After removing the non important features, I tried to split the data on MPAA rating to see if that could help improve my score and my R^2 scores went up in all but the rated G category of movies for which the sample size was too small (just 96 movies).
- G: R^2=0.513
- PG: R^2=0.793
- PG-13: R^2=”0.712″
- R: R^2=.777
This was a quick project to try regression modeling on some select features in order to predict worldwide gross. I learned that grouping by relevant features can greatly increase the quality of prediction and that I should check my assumptions before modeling! This post will be updated with code and graphs in the near future.