Confidence Interval Coverage in Weighted Surveys: A Simulation Study

My colleague Isaac Lello-Smith and I wrote a paper on how to obtain valid confidence intervals in R for weighted surveys. You can read the .pdf here and check out the code at GitHub.

There are many of schools of thought out there on methods for estimating standard errors from weighted survey data, and many books don’t tell you why a method may or may not be valid. So, we decided to simulate a situation that we often see in our work and see what worked best. A few highlights:

Screen Shot 2019-04-09 at 6.06.57 PM.png
  • Be careful with “weights” arguments in R! Read the documentation carefully. There are many types of weights in statistics, and your standard errors, p-values, and confidence intervals can be wildly wrong if you supply the wrong type of weight.

  • Use bootstrapping to calculate standard errors. We found that even estimation methods that were made for survey weights underestimated standard errors.

  • Be skeptical of confidence intervals in weighted survey contexts. We found that standard errors were underestimated when simulating real-world imperfections with the data, such as measurement error and target error. 95% confidence intervals were only truly 95% in best-case scenarios with low error.

Using Word Similarity Graphs to Explore Themes in Text: A Tutorial

One of the first questions people ask about text data is, “What is the text about?” This search for topics or themes involves reducing the complexity of the text down to a handful of meaningful categories. I have found that a lot of common approaches for this are not as useful as advertised. In this post, I’m going to explain and demonstrate how I use word similarity graphs to quickly explore and make sense of topics in text data.


Other Approaches

I have used a number of platforms that advertise “using AI” to examine themes in text data. I am generally underwhelmed with these platforms: Themes hardly make much sense; the underlying algorithm is not explained, so it is hard to figure out why things don’t make sense; and they are rigid interfaces, not letting the user tune hyperparameters or change how the text is vectorized.

I have also experimented with a number of common unsupervised learning techniques. I appreciate the idea of Latent Dirichlet allocation (Blei, Ng, & Jordan, 2003), because it is a probabilistic model that is explicit about its parameters, how it is estimated, the assumptions on which it relies, etc. However, I have rarely seen results that make much sense—other than toy examples that text mining books use. Selecting an appropriate k has been difficult, too, as the four metrics programmed into the ldatuning package (Nikita, 2016) tend to disagree with one another, some frequently recommending to choose the least k while others indicating the most.

I also looked at vectorizing the documents, generally using tf-idf (Silge & Robinson, 2017, Ch. 3), and applying my go-to clustering algorithms, like k-NN or DBSCAN (Ester, Kriegel, Sander, & Xu, 1996). I generally wasn’t satisfied with the results here, either. For example, difficulties might arise due to Zipf’s law (Silge & Robinson, 2017, Ch. 3), as the p-dimensional space we are projecting documents into is sparse, which causes problems for k-NN (e.g., Grcar, Fortuna, Mladenic, & Grobelnik, 2006).

There are undoubtedly use-cases for each of these approaches—and text mining is not my expertise—but my experiences led me to use word similarity graphs for exploring topics.


Word Similarity Graphs

I have found word similarity graphs (or networks) to be useful. The three primary steps involved:

  1. Calculate the similarities between words. This makes the word—not the document—the unit of analysis. Instead of looking for documents close to one another in p-dimensional space, we will be looking for groups of words that co-occur often. I generally focus on words that appear in at least J% of documents—but less than L% of documents—and I flatten any similarity scores between two words that fall below the Ith percentile to 0.

  2. Format these similarity scores into a symmetric matrix, where the diagonal contains 0s and the off-diagonal cells are similarities between words. Use this as an adjacency matrix to plot a network graph.

  3. Cluster nodes in this graph using a community detection algorithm. I use the Walktrap algorithm, which is based on random walks that take T steps in the network.

I like treating words as the unit of analysis; it makes sense to me to think of topics as being made up of words often used in conjunction with one another. This is also a naturally visual method, and plotting the network helps me understand topics better than columns of weights corresponding to latent topics. However, this approach is exploratory. I generally choose the values of the J, L, I, and T hyperparameters based on subjective decisions—not by minimizing a loss function; nonetheless, it helps me understand what is being talked about in text.

I will show how to use a few simple functions I wrote as I describe this technique in more detail. The full code for each of the functions are found at the end of this post. The data I am using are headlines from news stories and blogs in the last 6 months that mention “Star Wars” (if the sequel trilogy makes you angry—you’re wrong, but I hope you still read the rest of this post). The data and all code necessary to reproduce this post can be found at my GitHub.


Preparation

The functions rely on the tidyverse being loaded into the current session, and it requires the lsa and igraph packages to be installed. Before running any of the similarity or clustering functions, I run:

library(tidyverse)
library(tidytext)
library(igraph)
library(ggraph)
data(stop_words)
dat <- read_csv("starwars.csv") %>% 
  transmute(
    id = 1:nrow(.), # headline identification number for reference
    text = gsub("[-/]", " ", title),
    text = tolower(gsub("[^A-Za-z ]", "", text))
  ) %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words, by = "word") %>% 
  filter(word != "cnet") # corpus-specific stop word


Calculating Similarity

Many metrics exist to quantify how similar two units are to one another; distance measures can also be inverted to measure similarity (see Cha, 2007; Choi, Cha, & Tappert, 2010; Lesot, Rifqi, & Benhadda, 2009 for reviews). I started with an assumption that words belonging to the same topic will be used in the same document (which can be a story, chapter, song, sentence, headline, and so on), so I decided the foundation of word similarities here should be co-occurrences. If the words “Darth” and “Vader” appear together in 161 headlines (see below), then their similarity score would be 161.

dat %>% 
  group_by(id) %>% 
  summarise(vader = all(c("darth", "vader") %in% word)) %>% 
  with(sum(vader))
## [1] 161

A problem arises, however, in considering how frequently each word is used. The words “Princess” and “Leia” occur 31 times together, but “Princess” is used far less frequently than “Darth” in general (see below). Does that mean the words “Princess” and “Leia” are less similar to one another than “Darth” and “Vader”? Not necessarily.

dat %>% 
  group_by(id) %>% 
  summarise(leia = all(c("princess", "leia") %in% word)) %>% 
  with(sum(leia))
## [1] 31
dat %>% 
  group_by(id) %>% 
  summarise(
    darth = "darth" %in% word, 
    vader = "vader" %in% word, 
    princess = "princess" %in% word, 
    leia = "leia" %in% word
  ) %>% 
  select(-id) %>% 
  colSums()
##    darth    vader princess     leia 
##      236      232       39      116

We can overcome this difference in base rates by normalizing (standardizing) the co-occurrences. I use the cosine similarity for this, which is identical to the Ochiai coefficient in this situation (Zhou & Leydesdorff, 2016). The cosine similarity gets its name from being the cosine of the angle located between two vectors. In our case, each vector is a word, and the length of these vectors is the number of documents. If the word appears in a document, it is scored as “1”; if it does not, it is “0.” For simplicity’s sake, let’s imagine we have two documents: “Darth” appears in the first, but not the second; “Vader” appears in both. Plotted on two-dimensional space, the vectors look like:

data.frame(word = c("darth", "vader"), d1 = 1, d2 = 0:1) %>% 
  ggplot(aes(x = d1, y = d2)) +
  geom_point() +
  coord_cartesian(xlim = 0:1, ylim = 0:1) +
  geom_segment(aes(x = 0, y = 0, xend = d1, yend = d2)) +
  theme_minimal() +
  theme(text = element_text(size = 18))

This is a 45 degree angle. We can make sure that the cosine of 45 degrees is the same as the cosine similarity between those two vectors:

cos(45 * pi / 180) # this function takes radians, not degrees
## [1] 0.7071068
lsa::cosine(c(1, 1), c(0, 1))
##           [,1]
## [1,] 0.7071068

The binary (1 or 0) scoring means that words are never projected into negative space—no numbers below 0 are used. This means that negative similarities cannot occur. In the two-dimensional example above, the largest angle possible is 90 degrees, which has a cosine of 0; the smallest angle possible is 0 degrees, which has a cosine of 1. Similarities are thus normalized inside of the 0 (e.g., words are never used together) to 1 (e.g., words are always used together) range.

I wrote a function that takes a tokenized data frame—where one column is named word and another is named id—and returns a symmetric cosine similarity matrix. There are three other arguments. First, what proportion of documents must a word appear in to be considered? This makes sure that words only used in one or two documents are not included. I generally tune this so that it takes the top 85 to 120 words. Second, what proportion of documents is too many to be considered? In the present example, the words “Star” and “Wars” appear in every headline, so they would not tell us differentiating information about topics. I usually set this to be about .80. Third, how large must the similarity be to be included in the word similarity graph? I define this as a percentile. If it is set at .50, for example, then the function will shrink the similarities that are below the median to 0. This is to cut down on spurious or inconsequential relationships in the graph. I generally set this to be somewhere between .65 and .90. There is a lot of debate in the literature about how to filter these graphs (e.g., Christensen, Kenett, Aste, Silvia, & Kwapil, 2018), and I still need to experiment with these different filtering methods to come to a more principled approach than the arbitrary one I currently use.

Using the function shown at the end of this post, I compute the cosine similarity matrix using the following code:

cos_mat <- cosine_matrix(dat, lower = .01, upper = .80, filt = .80)

Since 8,570 documents (headlines) are in this corpus, the only words used in this graph must appear in more than 85.7 documents and less than 6,856. I only graph the similarities that are in the uppermost quintile (i.e., similarity > 80th percentile). This leaves 83 words:

dim(cos_mat)
## [1] 83 83


Making the Graph

A background on network theory and analysis is outside the scope of the post—but see Baggio, Scott, and Cooper (2010); Borgatti and Halgin (2011); Borgatti, Mehra, Brass, and Labianca (2009); and Talesford, Simpson, Burdette, Hayasaka, and Laurienti (2011) for introductions. We can build the network from our similarity matrix by using the igraph function to do so and then plot it using the ggraph package, which I like because it employs the same grammar as ggplot2. A random seed is set so that the layout of the graph is reproducible.

g <- graph_from_adjacency_matrix(cos_mat, mode = "undirected", weighted = TRUE)

set.seed(1839)
ggraph(g, layout = "nicely") +
  geom_edge_link(aes(alpha = weight), show.legend = FALSE) + 
  geom_node_label(aes(label = name)) +
  theme_void()

Words used in the same headlines appear closer to one another and have darker lines connecting them. This is obvious when looking at entities described by two words: J.J. Abrams, Episode IX, Blu-Ray, Darth Vader, Millennium Falcon. Topics should already be appearing to readers familiar with Star Wars, but we can make this clearer by finding words that cluster together.


Clustering the Graph

“Community detection” in social networks refers to finding groups of nodes (in our case, words are the nodes, but they can represent people, proteins, webpages, and so on) that have edges (the lines between the nodes) connecting to one another more than to other nodes in the network (Fortunato & Hric, 2016).

For example, imagine looking at emails in an organization. Each node is a person, and the number of emails sent between two people is the edge between them. If this organization has a marketing department and an accounting department, we would expect people inside of the same department to email one another more frequently than they email those outside of the department. If this were the case, a community detection algorithm applied to this email network would cluster people within the same department together. The case with words is analogous: We are looking to find “communities” of words used together more often than with other words, creating topics or themes.

The algorithm I use here is called Walktrap (Pons & Latapy, 2005; also this YouTube video). It uses random walks to operationalize distance between nodes. A random walk refers to starting at a node and following edges to another node. Each “step” is how many nodes have been visited after the starting point. The direction in which a step is taken is based on the strength of the edge between two nodes: The higher the strength (weight), the higher the probability that a step is taken to that node.

The Walktrap algorithm is based on the idea that two nodes within the same community will “visit” all other nodes in the graph in similar ways via random walks. For example, we want to know the distance between “battlefront” and “games.” Do these words visit, for example, “lightsaber” in the same way? To get at this, the algorithm finds the probability that a random walk starting at “battlefront” leads to “lightsaber” in T steps. Then, it finds the probability that a random walk starting at “games” leads to “lightsaber” in T steps. The algorithm will subtract these probabilities to get the distance: If these probabilities are similar, the distance between “battlefront” and “games” will be small. It doesn’t just look at walks to “lightsaber,” though: The Walktrap will calculate this distance based on random walks to all other nodes in the graph.

This distance is calculated for each pair of nodes in the graph, and then hierarchical clustering is applied to these distances. I wrote a wrapper—seen at the end of this post—for the igraph package’s Walktrap function to run this algorithm. It returns a list of objects useful for finding topics in word similarity graphs. You can run the function by:

topics <- walktrap_topics(g)

This returns a list of two objects. One is a dendrogram describing the hierarchical clustering process on the distances calculated from the random walks:

par(cex = .6)
plot(topics$dendrogram)


Examing Themes

The second object in the list returned by walktrap_topics() is a data.frame of two columns: word, which contains each word in the graph, and group, which contains the community to which each node belongs. I do a little bit of re-arranging of it to present the topics here:

topics$membership %>% 
  group_by(group) %>% 
  summarise(words = paste(word, collapse = ", "))
## # A tibble: 8 x 2
##   group words                                                              
##   <dbl> <chr>                                                              
## 1     1 abrams, carrie, cast, episode, fans, fisher, footage, hamill, ix, …
## 2     2 awakens, black, darth, fan, figure, film, force, jedi, lightsaber,…
## 3     3 disneyland, edge, falcon, galaxys, john, millennium                
## 4     4 animated, clone, comic, coming, empire, galaxy, marvel, movie, new…
## 5     5 battlefront, ea, game, games, ii, world                            
## 6     6 action, details, disney, disneys, jon, live, mandalorian, movies, …
## 7     7 advent, amazon, bb, free, lego, shipping, store, toys, wing        
## 8     8 blu, han, ray, solo, story

We could label the 8 themes in this corpus:

  1. Episode IX news

  2. Darth Vader: fan film and Black Series action figure

  3. Star Wars theme park

  4. Comics and animated series

  5. Battlefront II video game

  6. Live-action television series (Mandalorian, Cassian Andor)

  7. Toy shopping

  8. Solo: A Star Wars Story Blu-Ray release

We can then assign the vector of group labels as a feature called “cluster” to the nodes of our graph object, g, and plot the same network graph above, but with the themes showing up as different colors:

V(g)$cluster <- arrange(topics$membership, word)$group

set.seed(1839)
ggraph(g, layout = "nicely") +
  geom_edge_link(aes(alpha = weight), show.legend = FALSE) + 
  geom_node_label(
    aes(label = name, color = factor(cluster)), 
    show.legend = FALSE
  ) +
  theme_void()


Conclusion

Word similarity graphs have been useful for me as a first step to exploring topics in data. One can then follow-up on what these networks show by coding documents for given topics based on inclusion or exclusion of words in the text (e.g., any document that contains the words “live” and “action” refer to television series). This is a flexible approach, as well. Instead of cosine similarity, one could use a different metric; instead of the Walktrap, one could use a different community detection algorithm (the igraph package alone offers quite a few; see this StackOverflow answer).


R Functions

cosine_matrix <- function(tokenized_data, lower = 0, upper = 1, filt = 0) {
  
  if (!all(c("word", "id") %in% names(tokenized_data))) {
    stop("tokenized_data must contain variables named word and id")
  }
  
  if (lower < 0 | lower > 1 | upper < 0 | upper > 1 | filt < 0 | filt > 1) {
    stop("lower, upper, and filt must be 0 <= x <= 1")
  }
  
  docs <- length(unique(tokenized_data$id))
  
  out <- tokenized_data %>%
    count(id, word) %>%
    group_by(word) %>%
    mutate(n_docs = n()) %>%
    ungroup() %>%
    filter(n_docs < (docs * upper) & n_docs > (docs * lower)) %>%
    select(-n_docs) %>%
    mutate(n = 1) %>%
    spread(word, n, fill = 0) %>%
    select(-id) %>%
    as.matrix() %>%
    lsa::cosine()
  
  filt <- quantile(out[lower.tri(out)], filt)
  out[out < filt] <- diag(out) <- 0
  out <- out[rowSums(out) != 0, colSums(out) != 0]
  
  return(out)
}

walktrap_topics <- function(g, ...) {
  wt <- igraph::cluster_walktrap(g, ...)
  
  membership <- igraph::cluster_walktrap(g, ...) %>% 
      igraph::membership() %>% 
      as.matrix() %>% 
      as.data.frame() %>% 
      rownames_to_column("word") %>% 
      arrange(V1) %>% 
      rename(group = V1)
  
  dendrogram <- stats::as.dendrogram(wt)
  
  return(list(membership = membership, dendrogram = dendrogram))
}


References

Baggio, Scott, & Cooper (2010). Network science: A review focused on tourism. Annals of Tourism Research. Retrieved from: https://arxiv.org/pdf/1002.4766.pdf

Blei, Ng, & Jordan (2003). Latent Dirichlet allocation. Journal of Machine Learning Research. Retrieved from: http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

Borgatti & Halgin (2011). On network theory. Organization Science. doi: 10.1287/orsc.1100.0641

Borgatti, Mehra, Brass, & Labianca (2009). Network analysis in the social sciences. Science. doi: 10.1126/science.1165821

Cha (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences. Retrieved from: http://users.uom.gr/~kouiruki/sung.pdf

Choi, Cha, & Tappert (2010). A survey of binary similarity and distance measures. Systemics, Cybernetics and Informatics. Retrieved from: http://www.iiisci.org/journal/CV$/sci/pdfs/GS315JG.pdf

Christensen, Kenett, Aste, Silvia, & Kwapil (2018). Network structure of the Wisconsin Schizotypy Scales-Short Forms: Examining psychometric network filtering approaches. Behavior Research Methods. doi: 10.3758/s13428-018-1032-9

Ester, Kriegel, Sander, & Xu (1996). A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Retrieved from: https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf

Fortunato & Hric (2016). Community detection in networks: A user guide. Physics Reports. Retrieved from: https://arxiv.org/pdf/1608.00163.pdf

Grcar, Fortuna, Mladenic, & Grobelnik (2006). Data sparsity issues in the collaborative filtering framework. International Workshop on Knowledge Discovery on the Web: Advances in Web Mining and Web Usage Analysis. doi: 10.1007/11891321_4

Lesot, Rifqi, & Benhadda (2009). Similarity measures for binary and numerical data: A survey. International Journal of Knowledge Engineering and Soft Data Paradigms. Retrieved from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.212.6533&rep=rep1&type=pdf

Nikita (2016). ldatuning: Tuning of the Latent Dirichlet allocation models parameters. R Package version 0.2.0

Pons & Latapy (2006). Computing communities in large networks using random walks. Journal of Graph Algorithms and Applications. Retrieved from: http://jgaa.info/accepted/2006/PonsLatapy2006.10.2.pdf

Silge & Robinson (2017). Text mining with R: A tidy approach. Retrieved from: https://www.tidytextmining.com/

Talesford, Simpson, Burdette, Hayasaka, & Laurienti (2011). The brain as a complex system: Using network science as a tool for understanding the brain. Brain Connectivity. doi: 10.1089/brain.2011.0055

Zhou & Leydesdorff (2016). The normalization of occurrence and co-occurrence matrices in bibliometrics using cosine similarities and Ochiai coefficients. Journal of the Association for Information Science and Technology. Retrieved from: https://arxiv.org/pdf/1503.08944.pdf

How Do We Judge a Probabilistic Model? Or, How Did FiveThirtyEight Do in Forecasting the 2018 Midterms?

Nate Silver and the FiveThirtyEight folks talked quite a bit this election cycle about how their model was probabilistic. What this means is that they don’t just offer a single prediction (e.g., “Democrats will pick up 10 seats”); instead, their prediction is a distribution of possible outcomes (e.g., “We expect the result to be somewhere between Republicans picking up 5 seats to Democrats picking up 35”). One of the most common ways of thinking probabilistically is thinking in the long-term: “If this same election were to happen a large number of times, we would expect the Democrat to win 70% of the time.”

It is puzzling to judge one of these models, however, because we only see one realization of this estimated distribution of outcomes. So if something with only a 5% predicted chance of happening actually does happen, was it because (a) the model was wrong? or (b) the model was right, but we just observed a very rare event happening?

We need to judge the success of probabilistic models in a probabilistic way. This means that we need a large sample of predictions and actual results; unfortunately, this can only be done after the fact. However, it can help us think more about what works and who to trust in the future. Since FiveThirtyEight predicted 366 elections in the 2018 midterms, all using the same underlying model, we have the requisite sample size necessary to judge the model probabilistically.

How do we judge a model probabilistically? The model gives us both (a) predicted winners and (b) predicted probabilities that these predicted winners actually do win. The model can be seen as “accurate” to the extent that the predicted probabilities correspond to observed probabilities of winning. For example, we can say that the model is “probabilistically accurate” if predicted winners with a 65% chance of winning actually win 65% of the time, those with a 75% chance of winning actually win 75% of the time, etc.

I want to step away from asking if the model was black-and-white “right or wrong,” because—strictly speaking—no model is “right” or “correct” in the sense of mapping onto ground truth perfectly. The FiveThirtyEight model—just like any statistical or machine learning model—will make certain assumptions that are incorrect, not measure important variables, be subject to error, etc. We should instead ask if the model is useful. In my opinion, a model that makes accurate predictions while being honest with us about the uncertainty in its predictions is useful.

Was the FiveThirtyEight 2018 midterm model probabilistically accurate? Before we get to the model, I want to show what the results would look like in a perfect situation. We can then compare the FiveThirtyEight model to this “perfect model.”

I am going to simulate 100,000 races. Let’s imagine that we make a model, and this model gives us winners and predicted probabilities for each winner in every single race. What we want to do is model this predicted probability exactly. In my simulation below, the probabilities that actually generate the winner are the same exact probabilities that we pretend we “modeled.” This is called x below. These probabilities range from .501 to .999.

We then simulate the races with those probabilities. The results are called y below. The result is 1 if the predicted winner actually won, while the result is 0 if they lost.

set.seed(1839)
n_races <- 100000
x <- runif(n_races, .501, .999)
y <- rbinom(n_races, 1, x)
sim_dat <- data.frame(x, y)
head(sim_dat)
##           x y
## 1 0.9215105 1
## 2 0.8136280 1
## 3 0.6239117 1
## 4 0.8290906 1
## 5 0.8180035 0
## 6 0.7973508 1

We can now plot the predicted probabilities on the x-axis and whether or not the prediction was correct on the y-axis. Points will only appear on the y-axis at 0 or 1; we would generally use a logistic regression for this, but that imposes a slight curve to the line. We know that it should perfectly straight, so I fit both an ordinary least squares line (purple) and logistic regression line (gold) below.

If the FiveThirtyEight model were completely accurate in its win probability prediction, we would expect the relationship between predicted win probability (x) and observed win probability (y) to look like:

And this is what the results for the 2018 midterms (all 360 Senate, House, and Governor races that have been decided) actually looked like:

I am using the deluxe forecast the morning of the election. The dotted line represents what would be a perfect fit. Any line that has a steeper slope and is mostly below the dotted line would mean that the model is overestimating its certainty; any line with a smaller slope and is mostly above the dotted line indicates that the model is underestimating its certainty. Since the line here resides above the dotted line, this means that the FiveThirtyEight model was less certain in its predictions than it should have been. For example, a race with a predicted probability of 60% winning actually turned out to be correct about 70-75% of the time.

However, there are a few outstanding races. As of publishing, these are:

State Race Type Candidate Party Incumbent Win Probability
MS Senate Cindy Hyde-Smith R False 0.707
GA House Rob Woodall R True 0.848
NY House Anthony Brindisi D False 0.604
NY House Chris Collins R True 0.797
TX House Will Hurd R True 0.792
UT House Ben McAdams D False 0.642

Let’s assume that all of these predictions come out to be wrong—that is, let’s assume the worst case scenario for FiveThirtyEight. The results would still look quite good for their model:

We can look at some bucketed analyses instead of drawing a fitted line, too. The average FiveThirtyEight predicted probability of a correct prediction was 92.1%, and the average actual correct prediction rate was 95.8%. However, most of the races were easy to predict, so let’s limit these to ones that are below 75% sure, below 65% sure, and below 55% sure:

Threshold Count Mean Win Probability Correct Rate
0.75 45 0.627 0.756
0.65 30 0.591 0.667
0.55 8 0.522 0.750

The actual correct rate is higher than the predicted probability in all cases; again, we see that FiveThirtyEight underestimated the certainty they should have had in their predictions.

For a full view, here are all of the incorrect FiveThirtyEight predictions. They are sorted from highest win probability to lowest, so the table is ordered such that bigger upsets are at the top of the table.

State Race Type Candidate Party Incumbent Win Probability
OK House Steve Russell R True 0.934
SC House Katie Arrington R False 0.914
NY House Daniel Donovan R True 0.797
FL Governor Andrew Gillum D False 0.778
FL Senate Bill Nelson D True 0.732
KS House Paul Davis D False 0.641
IA Governor Fred Hubbell D False 0.636
OH Governor Richard Cordray D False 0.619
IN Senate Joe Donnelly D True 0.617
MN House Dan Feehan D False 0.599
GA House Karen C. Handel R True 0.594
VA House Scott Taylor R True 0.594
TX House John Culberson R True 0.553
NC House Dan McCready D False 0.550
TX House Pete Sessions R True 0.537

Is underestimating certainty—that is, overestimating uncertainty—a good thing? Nate Silver tongue-in-cheek bragged on his podcast that the model actually did too well. He was joking, but I think this is an important consideration: If a model is less certain than it should have been, what does that mean? I think most people would agree that it is better for a model to be 10% less certain than to be 10% more certain than it should have been, especially after the surprise of 2016. But if we are judging a model by asking if it is truthful in expressing uncertainty, then we should be critical of a model that makes correct predictions, but does so by saying it is very unsure. I don’t think there is a big enough discrepancy here to pick a fight with the FiveThirtyEight model, but I do think that inflating uncertainty can be a defense mechanism that modelers can use to protect themselves if their predictions are incorrect.

I think it is safe to say that FiveThirtyEight performed well this cycle. This also means that the polls, fundamentals, and expert ratings that underlie the model were accurate as well—at least in the aggregate. In my probabilistic opinion, we should not say that the FiveThirtyEight model was “wrong” about Steve Russell necessarily, because a predicted probability of 93% means that 7% of these will come out contrary to what was expected—and the observed probabilities mapped onto FiveThirtyEight’s predicted probabilities pretty accurately. That is, FiveThirtyEight predictions made with 93% certainty actually came to pass more than 95% of the time.

I think probabilistic forecasts are very useful for political decision makers. It keeps us from seeing a big deficit in the polls and writing off the race as unwinnable—we should never assume that the probability of an event is 100%. For example, it was said in the run-up to the midterms that a race needs to be close to even consider donating money to it: “If the race isn’t between 2% or 3%, you’re wasting your money.” There is uncertainty in all estimates—so candidates with a 90% likelihood of winning are going to lose 10% of the time. The trick for analysts and professionals are to make sense of probabilistic forecasts. Were there any warning signs ahead of time that hinted Steve Russell, Katie Arrington, or Daniel Donovan were going to lose? Was there a paucity of polling data are not a lot of money spent in those elections? Were there highly localized aspects to the races that were obscured by the model, which allowed for correlation between similar demographies and geographies? People should certainly invest more in races that are closer to being a toss-up, but thinking probabilistically requires is to invest some resources in longer shots than just a few percentage points, because unlikely events are unlikely, but they still do happen.

All data and code for this post can be found at my GitHub.