Why and How to Model Conditional Variance, with an Application to My Letterboxd Data

One of the main assumptions of linear regression taught in statistics courses is that of “constant variance” or “homoscedasticity.” Having data that do not have constant variance (i.e., are heteroscedastic) is then often treated as a problem—a nuisance that violates our assumptions and, among other things, produces inaccurate p-values. But I have been interested for some time in taking a different approach. Instead of seeing this non-constant variance as a nuisance that we need to adjust for, what if we saw it as part of the actual phenomenon of interest? That is, what if the changing variance is actually telling us something important about what we are studying? In this post, I’m going to describe a situation where predicted variance is meaningful for inferences we are trying to make, how to model it, and how it differs from ordinary least squares (OLS) regression. All the code and data need to reproduce this blog post can be found at my GitHub.

The Motivating Example

The use case here are my Letterboxd ratings. Letterboxd is an app that I use to log the films I’ve watched: It keeps track of when I watched a film, how I rated it (0.5 to 5.0 stars), and any text review I gave it on that watch. I started doing this at the beginning of the pandemic, and I been watching a lot of movies throughout this time of social distancing. Letterboxd has an amazing feature that allows you to export your data as a .csv. It also exports the year in which the film was released.

This leads to an obvious research question: Do I think movies are getting better—or worse—over time? I start my bringing in my data and regressing rating on year:

library(tidyverse)
library(gamlss)

ratings <- read_csv("ratings.csv") %>% 
  janitor::clean_names()

ggplot(ratings, aes(x = year, y = rating)) +
  geom_count(color = "#2D2D2A", alpha = .8) +
  geom_smooth(
    method = "lm",
    formula = y ~ x,
    se = FALSE,
    size = 1.5,
    color = "#7D82B8"
  ) +
  theme_light() +
  labs(x = "Year", y = "Rating") +
  scale_x_continuous(breaks = seq(1930, 2020, by = 10)) +
  theme(text = element_text(size = 16))
eda-1.png
summary(lm(rating ~ year, ratings))
## 
## Call:
## lm(formula = rating ~ year, data = ratings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.79500 -0.66621  0.02612  0.78370  1.84810 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.598037   5.179044   3.398 0.000749 ***
## year        -0.007155   0.002590  -2.762 0.006014 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.081 on 391 degrees of freedom
## Multiple R-squared:  0.01914,    Adjusted R-squared:  0.01663 
## F-statistic: 7.629 on 1 and 391 DF,  p-value: 0.006014

We see a significant relationship such that, for every year that passes, we expect about a 0.1 decrease in rating. But the scatter plot shows we do not have a constant variance: As year increases, the spread of scores gets wider. This violates the assumption of homoscedasticity. The general advice I got when learning regression was to use some type of robust standard error estimator so that this heteroscedasticity wouldn’t change Type I or Type II error rates.

But doing that ignores the real story of the data here: Scores get more extreme as time goes on. I watch many movies as they are being released, so I am much less discriminating when it comes to current movies than older movies. I watch, say, Zack Snyder’s Justice League, which I very much did not like, but I’m not watching whatever the 1960 version of that was. This heteroscedasticity reflects a sampling bias, then: I am much less discriminating about watching a movie that comes out contemporaneously than I am when I go back to watch films from before I was born.

How do we capture this in our model? We cannot with OLS regression. Note that the summary above tells us a “residual standard error.” This is constant, but we know it is not given the scatter plot and the way that the films were sampled (i.e., chosen to be watched).

The Model

To take a step back: What about OLS regression assumes constant variance? My website doesn’t like me trying to use Greek letters—plus I try to stay away from them when I can—so I’ll write the model out in code instead. Assume we have a predictor x, regression coefficients b_0 (the intercept), b_1 (the first predictor), an outcome y, and the residual e. We generally write it as:

y_i = b_0 + b_1 * x_i + e_i    where    e ~ N(0, sigma)

The e ~ N(0, sigma) means that the errors are distributed normally with a mean of zero and a standard deviation of sigma. I include the _i as a subscript to note that the y, x, and e variables differ by individual. Note that sigma does not have a subscript: It is the same for every person. This is where the assumption of constant variance comes from. Another way to write this equation is by saying:

y_i ~ N(b_0 + b_1 * x_i, sigma)

Which means that everyone is assumed to come from a normal distribution with the mean that is their predicted value (b_0 + b_1 * x_i) and a standard deviation that we calculate (sigma). The R package that I will be using refers to that predicted value as mu, so y_i ~ N(mu_i, sigma). What we will do is expand the model to add a subscript of i to the sigma, indicating that each respondent has a different standard deviation, as well as mean: y_i ~ N(mu_i, sigma_i).

We already have a regression equation predicting mu—the mean of the response variable. What we can do is make another regression equation predicting sigma—the standard deviation of the response variable. The only snag here is that a standard deviation or variance cannot be negative. What we can do is apply the log link function to our regression equation, which is generalized linear model trick to make sure all the predicted values are non-negative. (You may be familiar with the log link function from generalized linear models like Poisson regression).

This updated model then comes:

y_i ~ N(b_0 + b_1 * x_i, exp(b_3 + b_4 * x_i))

The exp makes sure that all of our predicted values are above zero. This shows a model where we were using just one variable x (e.g., a film’s release year) to predict both the mean and the standard deviation. Note that the same variables or completely different variables can be used for these two submodels, which we’ll call the mu and sigma submodels, respectively.

Simulating the Model

I think simulating data is a helpful tool for understanding how models work, so I do that here. First, I simulate 100,000 cases for predictor variables x and z that follow uniform distributions from 0 to 1:

n <- 100000
set.seed(1839)
x <- runif(n)
z <- runif(n)

I then make the mu and sigma submodels. Note that I use exp() around the sigma submodel to reflect the log link function that the R package will be using. I then simulate responses from a normal distribution, given everyone’s mu and sigma. Note that every person has a different predicted mu and sigma, distinguishing it from OLS regression, where each person has a different mu but the sigma is the same for everyone (i.e., constant variance):

mu <- 0.5 + 2 * x
sigma <- exp(1.5 + 3 * z)
y <- rnorm(n, mu, sigma)

Now we can use the gamlss package to estimate the model. This package is incredibly flexible in that it allows for many different error distributions, and it lets you specify a submodel for each of the parameters that define these error distributions. For now, we are going to use the normal distribution, or NO() in this package. The first formula we supply is by default mu submodel, and then we specify the sigma.formula afterward. Note that we are predicting each from x and z. After fitting this, we pull out the coefficients from the two different submodels:

m0 <- gamlss(y ~ x + z, sigma.formula = ~ x + z, family = NO())
## GAMLSS-RS iteration 1: Global Deviance = 883354.8 
## GAMLSS-RS iteration 2: Global Deviance = 883350.8 
## GAMLSS-RS iteration 3: Global Deviance = 883350.8
round(coef(m0, "mu"), 1)
## (Intercept)           x           z 
##         0.5         2.0         0.1
round(coef(m0, "sigma"), 1)
## (Intercept)           x           z 
##         1.5         0.0         3.0

Note that we specified above the intercept for mu submodel to be 0.5 and the coefficient for x to be 2.0. We didn’t include z, implicitly saying that the coefficient should be zero. We see those numbers! (We are .1 off for the z coefficient, but that can be expected due to random sampling). Same goes for the sigma submodel: The intercept we set is at 1.5, and the coefficient for z is 3. We observe these, showing us that we understand the model. We can simulate fake data and recover those parameters. So let’s apply it to my Letterboxd data.

Back to the Letterboxd Data

So now let’s fit a gamlss model predicting both predicted mean of rating—and its standard deviation—from the year in which the film was released:

m1 <- gamlss(rating ~ year, ~ year, family = NO(), data = ratings)  
## GAMLSS-RS iteration 1: Global Deviance = 1162.715 
## GAMLSS-RS iteration 2: Global Deviance = 1162.656 
## GAMLSS-RS iteration 3: Global Deviance = 1162.656
summary(m1)
## ******************************************************************
## Family:  c("NO", "Normal") 
## 
## Call:  gamlss(formula = rating ~ year, sigma.formula = ~year,  
##     family = NO(), data = ratings) 
## 
## Fitting method: RS() 
## 
## ------------------------------------------------------------------
## Mu link function:  identity
## Mu Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.529525   4.432572   3.729 0.000221 ***
## year        -0.006620   0.002224  -2.976 0.003101 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## Sigma link function:  log
## Sigma Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.290e+01  9.239e-01  -13.97   <2e-16 ***
## year         6.485e-03  6.772e-05   95.75   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## No. of observations in the fit:  393 
## Degrees of Freedom for the fit:  4
##       Residual Deg. of Freedom:  389 
##                       at cycle:  3 
##  
## Global Deviance:     1162.656 
##             AIC:     1170.656 
##             SBC:     1186.551 
## ******************************************************************

The first set of coefficients is for the mu submodel, or the mean. The second set is for the sigma submodel. We see that year is a significant predictor of each. The older the film release is, the higher the predicted average score, and the lower the predicted variance of that distribution is.

I think examining predicted values can help us understand this better, so I get the predicted mean and standard deviation for every year between 1934 (the year of the oldest movie I’ve watched) and 2021. Let’s look at the predicted values for 1940, 1970, 2000, and 2020:

plot_data <- data.frame(year = min(ratings$year):max(ratings$year))
# code is a little circuitous because predict.gamlss fails with new columns
pred_mu <- predict(m1, "mu", newdata = plot_data)
pred_sigma <- predict(m1, "sigma", newdata = plot_data, type = "response")
plot_data$mu <- pred_mu
plot_data$sigma <- pred_sigma

plot_data %>% 
  filter(year %in% c(1940, 1970, 2000, 2020)) %>% 
  mutate(across(where(is.double), round, 2))
##   year   mu sigma
## 1 1940 3.69  0.72
## 2 1970 3.49  0.88
## 3 2000 3.29  1.07
## 4 2020 3.16  1.22

For movies coming out in 1970, we’re predicting that they have about a 3.5 score on average, with a standard deviation of 0.88. For 2020, however, we are predicting that they have about a 3.2 average score with a standard deviation of 1.22.

This model captures the real story of the data: Scores are more varied as time goes on, due to how I select which movies to watch. The sigma submodel allows us to make inferences about these relationships, which might be phenomenon of interest to researchers and analysts.

We can also plot the predicted mean as well as a standard deviation above and below this mean. This shows how the spread increases as time goes on:

ggplot(plot_data, aes(x = year, y = mu)) +
  geom_count(
    data = ratings,
    mapping = aes(x = year, y = rating),
    color = "#2D2D2A",
    alpha = .8
  ) +
  geom_line(size = 1.5, color = "#7D82B8") +
  geom_line(aes(y = mu + sigma), linetype = 2, color = "#7D82B8", size = 1.2) +
  geom_line(aes(y = mu - sigma), linetype = 2, color = "#7D82B8", size = 1.2) +
  theme_light() +
  labs(x = "Year", y = "Rating") +
  scale_x_continuous(breaks = seq(1930, 2020, by = 10)) +
  theme(text = element_text(size = 16)) +
  ylim(c(0.5, 5))
plotting_preds-1.png

Conclusion

We’re often taught to think about modeling and predicting average values. However, a lot of important social phenomenon deal with variance. For example, if there is a strict social norm about a topic, we might expect smaller variance than when there are weak norms about a topic. Or we might observe that people who highly identify with a group that has that norm have a smaller variance than those who are weakly identified with that group. I urge folks to think about situations where modeling variance of responses—the spread of the data—might be just as or more important than averages. The gamlss package provides a user-friendly interface to fit and draw inferences from these models.

Exploring the Star Wars "Prequel Renaissance" Using tidymodels and workflowsets

I set out to do two things with this post:

  1. Show how one can use workflowsets in the tidymodels context to compare different pre-processing and model types against one another. Until recently, there wasn’t a unified way to do this in the R ecosystem. So, I wanted to give workflowsets a try, because I’m glad it exists now.

  2. Do this while examining an unorthodox case. I want to look at the role of the Star Wars sequel movies (Episodes VII, VIII, and IX) in how the perception of the Star Wars prequel movies (Episodes I, II, and III) have changed.


Introduction

In recent years, some have talked about a “prequel renaissance” in that the Star Wars prequels, which were not received well at the time, have come into popularity in recent years. I think most of this is due to my generation aging into adulthood: We are looking back fondly on movies that we grew up with. There’s a cohort of fans who feel a nostalgia for them. Nostalgia is psychological concept I’ve worked a little on in my academic work, with my colleague Matt Baldwin.

When we talk about nostalgia, we think of it as a process of contrast: We think of something in our life now and think how it is different from how things used to be. This can elicit a wide range of emotions, both positive and negative.

My thinking is that the release of the sequels provided some people a salient referent by which to re-judge the prequels. People see movies they do not like now, and they compare it to how much better it was when they were a kid. (There’s also, obviously, a problem of the alt-right in popular culture franchises, and “anti-fandom” comes into play as people react in grievance against the increasing diversity in Star Wars stories).

In any case, my hypothesis is: The release of the sequel trilogies movies led to an increased regard for the prequels.

Though not perfect, I operationalize this through Rotten Tomatoes user reviews. I wrote a script, combining rvest and RSelenium, to flip through 500 pages of user reviews for each film. The code to do this can be found in the Appendix below, as well as at my GitHub repository for this post. I expect to see more favorable reviews at the releases of the sequel movies.

I found that this question couldn’t really be answered well in a typical inferential statistics framework. I could bucket dates and compare reviews using an ANOVA? I could fit a time series model? I could fit a regression model with discontinuity at the sequel movies’ release dates?

Instead, I chose to fit a few statistical learning models (splines, random forest, k-nearest neighbors), tune them, and compare them using crossvalidation. Instead of p-values or something else, I am looking for just an obvious answer: To see the reviews jump up in popularity when the sequels are released. I use the aforementioned models because they can handle a bit of jaggedness in the data.


Prepping Data

I am assuming some familiarity with the tidymodels framework here—focusing on workflowsets. I would guide folks to Julia Silge’s incredible tutorials to learn more about the tidymodels ecosystem.

After downloading 500 pages’ worth of reviews from each film, I read them in below. I found that I had reviews for all three films from April 9th 2012 and on, so I filter just to these reviews, and I combine them for one dataset. Since we need the date to be a numeric predictor, I center at the first day available: So, April 9th 2012 is coded 0, the next day is coded 1, and so on. I also split these into training and testing data, using the default 75% training/25% testing option in rsample::initial_split().

# prep -------------------------------------------------------------------------
library(tidyverse)
library(tidymodels)
library(workflowsets)
set.seed(1839)

dat <- list.files(pattern = "^ep") %>% 
  map_dfr(~read_csv(.x) %>% mutate(film = str_sub(.x, 1, 3))) %>% 
  transmute(film, date = lubridate::mdy(date), stars) %>% 
  na.omit() %>% # recent days don't parse and are NA
  filter(date >= "2012-04-09") %>% 
  mutate(date = as.numeric(date - lubridate::ymd("2012-04-09"))) %>% 
  initial_split()

dat_train <- training(dat)
dat_test <- testing(dat)

And now, we are set to define the models and data processing steps. This is a very simple case, as I have one outcome (the score, from 0.5 to 5.0 stars), and two predictors (date and film).


Models and Preprocessing

For the splines, I define typical OLS regression as the estimator. For the random forest, I am using the ranger package, and I will tune the number of variables it’ll use (a little silly, because here we only have two candidates, but it’s what I would do in a larger dataset, so I’m just being consistent with the practice here) and the minimum allowed data points in a terminal node. Lastly, we have a k-nearest neighbors model using the kknn package, tuning for neighbors, distance power (Manhattan being 1, Euclidean being 2), and type of kernel function for weighting distances.

# set models -------------------------------------------------------------------
lm_mod <- linear_reg() %>% 
  set_engine("lm")

rf_mod <- rand_forest(
  mode = "regression",
  trees = 1000,
  mtry = tune(),
  min_n = tune()
) %>% 
  set_engine("ranger")

nn_mod <- nearest_neighbor(
  mode = "regression", 
  neighbors = tune(),
  dist_power = tune(), 
  weight_func = tune()
) %>% 
  set_engine("kknn")

Then, I turn to pre-processing steps—what are called “recipes” in the tidymodels framework. I start with a base recipe, which is simply predicting stars from date and film, and then dummy scoring the categorical film variable. Then, I make a recipe for the regression model with a natural spline on date. I allow the degrees of freedom to be tuned in crossvalidation, and then I make a step allowing for date and film to interact with one another. Lastly, I make a z-scored recipe.

# make recipes -----------------------------------------------------------------
rec_base <- recipe(stars ~ date + film, dat_train) %>% 
  step_dummy(film)

rec_ns <- rec_base %>% 
  step_ns(date, deg_free = tune()) %>% 
  step_interact(~ starts_with("date"):starts_with("film"))

rec_zs <- rec_base %>% 
  step_normalize(all_predictors())

And here’s where the cool part comes in: I can now define a “workflow set.” I specify the recipes with preproc and the models with models. One can use cross = TRUE to look at all combinations of recipes and models. Here, I am setting that to FALSE so that the natural spline recipe is used for the linear regression, the base recipe is used for the random forest, and the standardized recipe is used for the nearest neighbors.

# workflowset ------------------------------------------------------------------
(wfs <- workflow_set(
  preproc = list(spline = rec_ns, base = rec_base, zscored = rec_zs),
  models = list(lm_mod, rf_mod, nn_mod),
  cross = FALSE
))
## # A workflow set/tibble: 3 x 4
##   wflow_id                 info                 option    result    
##   <chr>                    <list>               <list>    <list>    
## 1 spline_linear_reg        <tibble[,4] [1 × 4]> <opts[0]> <list [0]>
## 2 base_rand_forest         <tibble[,4] [1 × 4]> <opts[0]> <list [0]>
## 3 zscored_nearest_neighbor <tibble[,4] [1 × 4]> <opts[0]> <list [0]>

And we see we have our three different workflows set up in one unified place. In practice, one could define a large number of workflows and recipe/model combinations to test against one another.


Crossvalidation

I then turn to tuning the aforementioned hyperparameters using 10-fold crossvalidation. After I make the folds, I simply pipe my workflow set to a specific purrr::map()-esque function for workflows. I tell it to run the tune::tune_grid() function on each, defining a random grid of 30 different combinations of parameters for each of the three workflows. I’m doing grid search here, and the tidymodels folks have set reasonable ranges to explore for the hyperparameter space, which is why the code here looks so bare in defining what those grids cover.

# cross validation -------------------------------------------------------------
folds <- vfold_cv(dat_train, v = 10)

cv_res <- wfs %>%
  workflow_map("tune_grid", seed = 1839, grid = 30, resamples = folds)

Compare Models

We can then look at the results. What are the top 10 workflows?

# compare models and params ----------------------------------------------------
cv_res %>% 
  rank_results() %>% 
  filter(.metric == "rmse")
## # A tibble: 74 x 9
##    wflow_id    .config     .metric  mean std_err     n preprocessor model   rank
##    <chr>       <chr>       <chr>   <dbl>   <dbl> <int> <chr>        <chr>  <int>
##  1 base_rand_… Preprocess… rmse     1.19 0.00689    10 recipe       rand_…     1
##  2 base_rand_… Preprocess… rmse     1.19 0.00697    10 recipe       rand_…     2
##  3 base_rand_… Preprocess… rmse     1.19 0.00691    10 recipe       rand_…     3
##  4 base_rand_… Preprocess… rmse     1.19 0.00686    10 recipe       rand_…     4
##  5 base_rand_… Preprocess… rmse     1.19 0.00689    10 recipe       rand_…     5
##  6 base_rand_… Preprocess… rmse     1.19 0.00687    10 recipe       rand_…     6
##  7 base_rand_… Preprocess… rmse     1.19 0.00694    10 recipe       rand_…     7
##  8 base_rand_… Preprocess… rmse     1.19 0.00692    10 recipe       rand_…     8
##  9 base_rand_… Preprocess… rmse     1.19 0.00696    10 recipe       rand_…     9
## 10 base_rand_… Preprocess… rmse     1.19 0.00694    10 recipe       rand_…    10
## # … with 64 more rows

All of them are random forests, and all of them have the same average root-mean-square error. I can also easily call a plot to compare all possible combinations of models explored in the grid search:

autoplot(cv_res, rank_metric = "rmse", metric = "rmse")
comp-1.png

The random forests tended to do the best, with the linear regression with splines and an interaction defined doing about as well. It was k-nearest neighbors that really lagged behind.


Select Best Model

To actually examine my hypothesis, I’ll pull the best hyperparameter values for the random forest, and then I will finalize my workflow, fit it on the training data, and then predict it on the 25% testing holdout data.

# predict ----------------------------------------------------------------------
(best_results <- cv_res %>% 
  pull_workflow_set_result("base_rand_forest") %>% 
  select_best(metric = "rmse"))
## # A tibble: 1 x 3
##    mtry min_n .config              
##   <int> <int> <chr>                
## 1     2     5 Preprocessor1_Model12
final_fit <- cv_res %>% 
  pull_workflow("base_rand_forest") %>% 
  finalize_workflow(best_results) %>% 
  fit(dat_train)

dat_test <- bind_cols(dat_test, predict(final_fit, dat_test))

Results

So, what does it look like? The three vertical lines below are for the three sequel movies: The Force Awakens, The Last Jedi, and The Rise of Skywalker, respectively. Remember that I’ve scored date so that 0 is April 9th 2012 (the first review I have for all three films).

# plotting ---------------------------------------------------------------------
tfa <- as.numeric(lubridate::ymd("2015-12-18") - lubridate::ymd("2012-04-09"))
tlj <- as.numeric(lubridate::ymd("2017-12-15") - lubridate::ymd("2012-04-09"))
tros <- as.numeric(lubridate::ymd("2019-12-20") - lubridate::ymd("2012-04-09"))

ggplot(dat_test, aes(x = date, y = stars)) +
  facet_wrap(~ film) +
  geom_vline(aes(xintercept = tfa)) +
  geom_vline(aes(xintercept = tlj)) +
  geom_vline(aes(xintercept = tros)) +
  geom_count(alpha = .3) +
  geom_line(aes(x = date, y = .pred), color = "blue") +
  theme(legend.position = "top")
plotting-1.png

For Episode I: The Phantom Menace and Episode II: Attack of the Clones, we see a huge spike up at the release of Rian Johnson’s expectation-subverting masterpiece, Episode VIII: The Last Jedi. We see this a bit for Episode III: Revenge of the Sith, as well, but less so—this movie was already probably the best-regarded film from the prequel era.

Some might be wondering: The spline on date interacting with the film dummy variables had basically the same performance, what does that model look like?

# spline -----------------------------------------------------------------------
(best_results <- cv_res %>% 
  pull_workflow_set_result("spline_linear_reg") %>% 
  select_best(metric = "rmse"))
## # A tibble: 1 x 2
##   deg_free .config              
##      <int> <chr>                
## 1        8 Preprocessor12_Model1
final_fit <- cv_res %>% 
  pull_workflow("spline_linear_reg") %>% 
  finalize_workflow(best_results) %>% 
  fit(dat_train)

dat_test$pred_spline <- predict(final_fit, dat_test)$.pred

ggplot(dat_test, aes(x = date, y = stars)) +
  facet_wrap(~ film) +
  geom_vline(aes(xintercept = tfa)) +
  geom_vline(aes(xintercept = tlj)) +
  geom_vline(aes(xintercept = tros)) +
  geom_count(alpha = .3) +
  geom_line(aes(x = date, y = pred_spline), color = "blue") +
  theme(legend.position = "top")
spline-1.png

We see a similar trend, albeit smoother. I would probably use this model if I had to do prediction, given the noisiness of the random forest prediction (I would also round predictions to the nearest half star, which would help stabilize things). But that isn’t really what I was trying to do here. I was looking for some obvious evidence that people started judging the prequels better once the sequels were released.

And the data here pass the best statistical test of all: the interocular trauma test. The data are obvious in that positive reviews spiked after the release of The Last Jedi. To my knowledge, Joe Berkson coined the name of this test, and it first shows up in 1963 in Edwards, Lindman, and Savage’s “Bayesian Statistical Inference for Psychological Research” in Psychological Review, 70(3):

“The preceding paragraph illustrates a procedure that statisticians of all schools find important but elusive. It has been called the interocular traumatic test; you know what the data mean when the conclusion hits you between the eyes.”

The underlying mechanism is less clear. My first thought was nostalgia is evoked while people contrast the present to the past. But so-called “anti-fandom” could also be evoked when doing that contrast. Or it could be run-of-the-mill fanboy anger at contrasting the movies. I do think contrast plays a big role here, though, as we see the biggest effect after The Last Jedi, which brilliantly defied a lot of expectations about where the saga could go. Let’s hope we finally do see Rian Johnson’s own Star Wars trilogy someday.

Appendix

library(rvest)
library(RSelenium)
library(tidyverse)

# global params ----------------------------------------------------------------
# url <- "https://www.rottentomatoes.com/m/star_wars_episode_i_the_phantom_menace/reviews?type=user"
# url <- "https://www.rottentomatoes.com/m/star_wars_episode_ii_attack_of_the_clones/reviews?type=user"
# url <- "https://www.rottentomatoes.com/m/star_wars_episode_iii_revenge_of_the_sith/reviews?type=user"
pages_to_scrape <- 500

# funs -------------------------------------------------------------------------
get_stars <- function(the_html) {
  stars_text <- the_html %>% 
    html_nodes(".audience-reviews__score") %>% 
    as.character()
  
  stars <- map_dbl(stars_text, function(x) {
    score <- 0
    score <- score + str_count(x, "star-display__filled")
    score <- score + (str_count(x, "star-display__half") / 2)
  })
  
  return(stars)
}

get_dates <- function(the_html) {
  date_text <- the_html %>% 
    html_nodes(".audience-reviews__duration") %>% 
    as.character()
  
  dates <- map_chr(date_text, function(x) {
    str_split(x, ">") %>% 
      `[[`(1) %>% 
      `[`(2) %>% 
      str_remove("</span")
  })
  
  return(dates)
}

get_texts <- function(the_html) {
  text_text <- the_html %>% 
    html_nodes(".js-clamp") %>% 
    as.character() %>% 
    `[`(seq(1, length(.), by = 2))
  
  texts <- map_chr(text_text, function(x) {
    str_split(x, ">") %>% 
      `[[`(1) %>% 
      `[`(2) %>% 
      str_remove("[<].*")
  })
  
  return(texts)
}

get_reviews <- function(the_html) {

  reviews <- tibble(
    date = get_dates(the_html),
    stars = get_stars(the_html),
    text = get_texts(the_html)
  )
  
  return(reviews)
}

# scrape -----------------------------------------------------------------------
res <- tibble()
driver <- rsDriver(browser = "firefox", port = 1839L)
driver$client$navigate(url)

for (i in seq_len(pages_to_scrape)) {
  cat("page", i, "\n")
  
  res <- res %>% 
    bind_rows(get_reviews(read_html(driver$client$getPageSource()[[1]])))
  
  more <- driver$client$findElement(
    "css selector", 
    ".prev-next-paging__button-right"
  )
  
  more$clickElement()
  Sys.sleep(2)
}

driver$client$closeall()

# write out --------------------------------------------------------------------
# res %>% 
#   unique() %>% 
#   write_csv("ep1.csv")

# res %>%
#   unique() %>%
#   write_csv("ep2.csv")

# res %>%
#   unique() %>%
#   write_csv("ep3.csv")