Tuesday, December 15, 2015

Will the new Star Wars suck? An analysis of directors and movie involvement

How does a bad movie ever get made?

Considering that Hollywood's massive budgets provide access to the world's finest writers, directors, and actors, how are movies ever bad?

Well, as we all know, they often are. But how could a Star Wars movie, in particular, be anything but great? Many reasons have been proposed for the perceived poor quality of the prequels, with George Lucas typically in the crosshairs. One theory is that Lucas had too much creative control, his ideas couldn't be challenged, and a mediocre result was inevitable. JJ Abrams also has a lot of control over the new Star Wars; Rotten Tomatoes lists his roles as director, producer and screenwriter. Should we be worried?

In this article, we'll use a statistical model to analyze movie directors and their involvement as screenwriters and producers of movies, and we'll use that same model to predict the Rotten Tomatoes Tomatometer™ score for Star Wars: Episode VII - The Force Awakens. Truth be told, there will be more shooting from the hip than an actual Star Wars battle. There will be a lot of estimating, not a lot of focus on uncertainty, and hell, if we hit something, it's probably because we got lucky. But to even take a stab at it, we're going to need some data.

The Data

The data analyzed in this article consists of Rotten Tomatoes filmographies for 20 directors (well-known or behind recent hit movies) and a collective 207 movies rated "PG," "PG-13," or "R." While the filmographies come from Rotten Tomatoes, the movie-level information comes from the Open Movie Database (OMDB). For information on data quality and completeness, as well as the R script that pulled the data, refer to the Appendix.

The Model

Deadspin's Albert Burneko not only has the audacity to ask, "What If The New Star Wars Sucks, Too?", his foul-language cinematic analysis suggests that it may be the most likely outcome. His reasons?
  1. Only two of the first six Star Wars were very good, and thus there's only a 33% chance that the new one will be good as well.
  2. Within the Star Trek movies from director JJ Abrams, only one movie out of two was good, so we have at best a 50% chance that the new Star Wars will be good.
Interpreted under one statistical paradigm, Burneko suggests we can either imagine some process that generates Star Wars movies, or one that generates JJ Abrams movies. Which do you think is more plausible?

Given the level of influence that a director has, the second is more compelling to me. In comparing the two Star Trek movies that Abrams directed, Burneko controls for the genre of movie at a very fine level, so fine that there are only two movies. Actually, Abrams is only credited as director on four movies according to Rotten Tomatoes (we're ignoring TV programs), so in this article we'll use a statistical model that allows information to be shared among directors. While the model is limited by having only four movies from Abrams, we'll use it to learn about Hollywood in general (from 207 movies) and take a shot at the prediction of the Tomatometer rating for the new Star Wars.

The statistical model used seeks the probability of a positive critical rating for a movie, i.e., the expected Tomatometer rating (the actual Tomatometer rating is just ratio of positive reviews to total reviews). It controls for the MPAA (content) rating of the movie and the year the movie was released. The genre variable was not used due to the many values it assumes, though it's worth revisiting. As with all subjective performance metrics, a higher Tomatometer score will not necessarily guarantee a better movie. (Notably, despite Burneko's claim of an inferior Abram's Star Trek follow up in Into Darkness, a view this author shares, both movies received high Tomatometer ratings at 95% and 87%.)

At the director level, we can estimate how particular directors perform when they have more control. Since we don't have a lot of data for many of the directors, this type of model (called "hierarchical," "mixed effects," or "random coefficient" models) chooses a middle ground, estimating the effects on average while also letting individual directors deviate from this average. This amounts to information-sharing among directors, which is good if little information is available. Just note that directors without much information won't get a "clean slate"; they're assumed a priori to be somewhat like other directors.

Assumptions in a statistical model are like fine print on your insurance policy, not very interesting but maybe worth a glance. In order to share information among directors, the model assumes that the director-level effects are random values from a normal distribution. That I picked well-known and currently relevant directors makes this assumption less plausible, though it still provides a device for sharing information among the directors in the data set. Endogeneity may be a problem if, say, the effect of producing a movie is related to the opportunity to produce, and more talented directors get that opportunity more often. That would result in effect estimates that were biased upward in magnitude.

The Results

The big prediction

The model's prediction of the new Star Wars Tomatometer rating is 82% fresh, with a crude 95% confidence interval stretching from 40% to 100% fresh. That's a lot of uncertainty! And you can throw into that mix the knowledge that this Star Wars movie isn't just any movie. Still, a point estimate is a point estimate, and this article holds its ground at 82% fresh. In the sections below, we'll look at how this prediction came to be. To replicate it exactly, you can skip to the Appendix and work with the R code.

Model results overall

There is no evidence of a detrimental effect of director involvement through writing and producing in general. Directionally at least, the estimates are slightly positive. But, as one might hypothesize, there is more evidence that it depends which director is doing the writing and directing. With that in mind, let's dig into the directors and see how things play out.

When directors just direct

Adjusted for calendar year and movie rating, below are the model scores for 20 directors assuming they only direct (they do not write or produce). While these scores have a complicated relationship to the expected Tomatometer rating, it is enough to note that higher scores translate to a higher likelihood of a critically acclaimed movie.

director score
JJ Abrams 4.33
Sam Mendes 3.74
Robert Zemeckis 3.33
Sam Raimi 3.33
Quentin Tarantino 3.20
Ridley Scott 3.20
George Lucas 3.14
Steven Spielberg 2.95
Francis Lawrence 2.93
Oliver Stone 2.91
James Cameron 2.64
George Miller 2.15
Bryan Singer 2.09
Bill Condon 2.04
Chris Columbus 2.03
Tom McCarthy 2.03
Ivan Reitman 1.88
Michael Bay 1.63
Paul Feig 1.34
Shawn Levy 1.10

It's great to see our new Star Wars director at the top of the list, despite only four movies conspiring to get him there. But since JJ Abrams has either written or produced every movie he has directed, we must admit that this is an extrapolation. Regardless, it's hard to deny that Abrams has been a critically successful director with Mission: Impossible III (70%), Star Trek (95%), Super 8 (82%), and Star Trek Into Darkness (87%).

While this article focuses on Abrams and Star Wars, for a moment consider director Paul Feig who has directed some very successful movies but is second to last on this list. Of the only four movies he has directed, the singular example where he only directed was Unaccompanied Minors, with a Tomatometer rating of only 31%. The model combines information where it can, but also weighs the data heavily as well. Other analysis strategies would have enabled us to adjust this weighting (a point we'll revisit), but Feig will get redemption in the next section.

When directors write and produce

Combining the director scores of the first table with the additive contributions of writing, producing, and any synergistic effects of the two leads to the following ranking of directors most likely to write and produce a great movie. To keep the article readable, the tables with the intermediate additive contributions have been omitted, but they can be recreated from the R code in the Appendix .

director score
Tom McCarthy 4.97
Paul Feig 4.74
Sam Raimi 4.61
George Miller 4.21
James Cameron 3.81
Bill Condon 3.80
Quentin Tarantino 3.75
JJ Abrams 3.70
Robert Zemeckis 3.44
Francis Lawrence 3.32
Ivan Reitman 3.26
Steven Spielberg 3.26
Sam Mendes 3.20
George Lucas 3.13
Michael Bay 2.73
Shawn Levy 2.63
Bryan Singer 2.60
Oliver Stone 2.10
Chris Columbus 1.83
Ridley Scott 1.25

Abram's score has dropped a bit, but one shouldn't read too much into it. The likely explanation is that he was not a screenwriter for his two highest rated movies. On the other end of the same spectrum, notice that Paul Feig, one of the lower ranking members of the "directs only" category, is now second from the top. The movies where he additionally wrote or produced all earned critical acclaim.


As promised, the predicted 82% Tomatometer rating for Star Wars was a shot from the hip, but a fun one to take. We also learned that, in general, greater involvement from a director in terms of writing and producing is no cause for fear. It may even be a good thing.

One reason there aren't more formal statistical metrics (p-values, etc.) included is that, despite my experience with the modeling technique used in this article (and previous articles here and here), there are a few tricky theoretical issues that plague it. It's just a lot easier to analyze the point estimates and predictions. A "bayesian" modeling approach may better handle the uncertainty present and allow prior information to be incorporated. It seems reasonable to believe that Paul Feig as a director will not alternate between nearly best and worst just because he did or did not additionally write and produce. But that's the cost of "letting the data speak" when there isn't much data!

But there is a lot of data. I only chose 20 directors and their movies, yet there are thousands more that could be pulled in and analyzed. This would also lead to a more representative data set.

One final thing I'd like to reflect on is, regardless of the quality of the model, we really don't know how good Star Wars: Episode VII - The Force Awakens is going to be. What happened on the set, the chemistry of actors that haven't worked together in decades, the guidance of a relatively untested director, and the pressure to perform, all make this movie a uniquely singular event. As a statistician, you do the best you can given the data and the constraints. As a fan of the movie franchise, you close your eyes and hope for good luck. May that force be with us all.


Data quality and completeness

My experience with the Rotten Tomatoes search API (as of Dec 2015) was that it wasn't capable of searching based on both title and year, a requirement for automated data collection since movies are often remade. The OMDB does offer search based on title and year and thus became the database used in this analysis. Unfortunately, many movies featured on the OMDB do not include Tomatometer scores, even when they exist on rottentomatoes.com. For example, the OMDB results for The Matrix (http://www.omdbapi.com/?t=The Matrix&y=1999&r=json&tomatoes=true) include Tomatometer scores while those for Star Wars: Episode I, (http://www.omdbapi.com/?t=Star Wars: Episode I - The Phantom Menace&y=1999&r=json&tomatoes=true) do not. I was unable to discover a pattern in movies that would explain this.

I decided to write this article with the data that could be collected automatically. An "N/A" Tomatometer score would appear if the movie didn't have a Tomatometer rating on rottentomatoes.com (often true for older movies) and for the unknown cases described in the previous paragraph. Keeping only those movies with Tomatometer scores, 350 movies became 211, and of those, 207 were rated "PG," "PG-13," or "R." These 207 movies comprise the data set used for this analysis, distributed over directors as described in the following table.

director just directs directs, writes directs, produces directs, writes, produces total
George Lucas 0 1 0 2 3
Tom McCarthy 0 2 0 1 3
JJ Abrams 0 1 2 1 4
Paul Feig 1 0 2 1 4
Francis Lawrence 5 0 0 0 5
Bill Condon 4 3 0 0 7
Bryan Singer 0 1 5 1 7
Sam Mendes 4 0 3 0 7
George Miller 3 2 1 2 8
James Cameron 0 4 0 5 9
Michael Bay 3 0 6 0 9
Quentin Tarantino 2 5 0 2 9
Sam Raimi 5 2 1 1 9
Shawn Levy 3 0 5 1 9
Robert Zemeckis 3 4 5 2 14
Chris Columbus 3 3 7 2 15
Ivan Reitman 2 0 13 0 15
Oliver Stone 2 7 1 7 17
Ridley Scott 6 0 16 1 23
Steven Spielberg 11 1 17 1 30

Details of the 82% point estimate

The way the model makes sense of Abram's writing and producing is the 4.33 overall director score plus -1.16 for writing alone, +0.20 for producing alone, and +0.34 for the combination of writing and producing. Taking these numbers as literal truth would be a mistake, as even sharing information among directors, there are just too few JJ Abrams movies to untangle these effects. In any case, those four numbers sum to 3.71, the value in the second table (up to rounding error). To actually get the probability of a positive critical rating (i.e., the expected Tomatometer rating), you'd need to add -1.91 for the calendar year (of 2015) and -0.25 for having a PG-13 rating. Thus, the combined total score for Star Wars is 1.55, which you send through the "S" shaped sigmoid function to get the probability estimate of 1 / (1 + exp(-1.55)), or right around .82.

Pulling the data with R

The automated data collection R script is available on Github. Director information, available in tabular form on Rotten Tomatoes, was translated to R data frames using rvest

Building the model in R

This section will provides a walk-through of the code on Github.

First, load the two libraries we're going to use for our analysis, lme4 for the random effects modeling and splines (a base R package) that we'll use for flexible time trend modeling. The first package, lme4, is the only one you need to install (install.packages("lme4")).
Next, you'll have to pull the data in using your method of choice. It can be downloaded to a csv file from this Google Spreadsheet and read in as an R data frame like this:
director.movies <- read.csv("[download_location]/movies.csv")
There are two data preparation operations in the analysis file. First, we'll analyze only those movies rated "PG", "PG-13", and "R." (Perhaps due to my non-random selection of directors, there were only two "G"-rated movies.). This leaves us with 207 movies from our 20 directors.
director.movies <- director.movies[director.movies$rating %in% c("PG", "PG-13", "R"), ]
Also, for numerical stability, we'll work with a scaled version of the calendar year:
director.movies$scaled.year <- scale(director.movies$year)
Our binomial mixed model uses the proportion of "fresh" critic votes as the response. The explanatory variables of interest are flags for whether a director has participated in writing the screenplay, director.writes, and whether a director was considered a producer of the movie, director.produces. Note that the part of the formula in parentheses determines which coefficients are allowed to vary by director. Thus, both director.writes, director.produces, their interaction, and an director level intercept vary at the director level. They are not free to vary arbitrarily, however, as the effects are assumed to arise from a normal distribution with a mean and a variance to be estimated.
movie.full <- glmer(cbind(tomato.fresh, tomato.rotten) ~ 
                    (1 + director.writes + director.produces +
                     director.writes:director.produces | director) + 
                    ns(scaled.year, df = 2) + rating + 
                    director.writes + director.produces +
                   family = binomial, data = director.movies)
To get a prediction, we'll need to create a data set with a row for the new Star Wars movie. Recall that we standardized calendar year, and need to include the transformed value of 2015 in the data set.
newdata <- data.frame(director = "JJ Abrams",
                      movie = "Star Wars: The Force Awakens",
                      scaled.year = (2015 - mean(director.movies$year)) / 
                      rating = "PG-13", 
                      director.writes = 1,
                      director.produces = 1)

predict(movie.full, newdata, type = "response")
The documentation for predict.glmer states that "there is no option for computing standard errors of predictions because it is difficult to define an efficient method that incorporates uncertainty in the variance parameters; we recommend bootMer for this task." This is how I got the confidence interval stated above, however it is a very crude interval estimate based on (1) lack of normality of the sampling distribution, (2) convergence issues in some of the bootstrap replicates, and (3) a small number of bootstrap replicates. Still, I think the interval makes its point: there is a lot of uncertainty in how good this movie is going to be.
sw.predict <- function(model) {
  predict(model, newdata, type = "response")

# Beware: this takes a long time to run and is not easy to stop
bootstrap.results <- bootMer(movie.full, FUN = sw.predict, nsim = 100,
                             .progress = "txt", seed = 14324)

mean(bootstrap.results$t) - 2 * sd(bootstrap.results$t)
mean(bootstrap.results$t) + 2 * sd(bootstrap.results$t)

To create the director comparison tables, we'll need to access the random effects "estimates" directly. It is a matter of terminology whether these are actually estimates since the model considers them random variables, but they'll serve the same purpose.
random.coefs <- coef(movie.full)$director[, c("(Intercept)", "director.writes",

names(random.coefs)[1] <- "intercept"
names(random.coefs)[4] <- "interaction"
random.coefs$director <- rownames(random.coefs)
rownames(random.coefs) <- NULL

random.coefs <- within(random.coefs, {
  writes.only <- intercept + director.writes
  produces.only <- intercept + director.produces
  writes.and.produces <- intercept + director.writes + director.produces + 
  writes.and.produces.delta <- director.writes + director.produces + interaction

Here are the two tables shown in the article, for illustration. Other comparisons can be made using the additional variables in random.coefs
random.coefs[order(random.coefs$intercept, decreasing = TRUE),
             c("director", "intercept")]

random.coefs[order(random.coefs$writes.and.produces, decreasing = TRUE),
             c("director", "writes.and.produces")]