Determination of Features that Contribute to High Gross Box Office Earnings

Project Motivation:

Many a time, people have faced this unavoidable problem: a movie has a poor ranking so should he or she see it regardless? While this problem may be trivial, this project will determine which key factors contribute to a box office movie grossing more than 100 million dollars.

To answer this question, Thomas Keady, Sambit Panda, and I implemented several machine learning algorithms on an IMDB movie dataset for our Data Mining (AMS 553.636) final project.

Research Details and Results:

This is a cropped version of our decision tree regression. It was enormous!

The decision tree regression results showed that actor facebook likes are highly influential in determining a box office success and the least important was screen aspect ratio. Our study would be extremely useful for a low-budget director looking to making their film.

Most Challenging Part of the Project:

The most difficult part of the project was determining which algorithm to use . We ran through a few different ones before we arrived at decision tree regression. At first, we classified our movie data into three categories based on MPAA rating–PG, PG-13, R. To best train our model, we plotted the Gross vs Duration for all movies and then performed the supervised learning algorithm, k-nearest neighbor (KNN), to find the nearest neighbor. After which, a Naïve Bayes classifier was applied. However, this approach was abandoned because the Naïve Bayes classifier mislabeled almost half the points. Next, we tried unsupervised learning using the clustering algorithm k-means and a silhouette analysis to find the ideal number of the clusters. However, the curse of dimenstionality (we had 28 features from the movie dataset) lead to visualization difficulties and large Euclidean distances, so this did not provide any useful information for an inference. Hence, we shifted to our final approach using a decision tree regression.