Using R and Python Together, Seamlessly: A Case Study Using OpenAI's GPT Models

Well, it looks like the time has finally come for me to join the club and write a large language model (LLM) blog post. I hope to do two things here:

  • Show how easy it is to seamlessly work with both R and Python code simultaneously

  • Use the OpenAI API to see how well it does extracting information from text

In my previous blog post, I discussed scraping film awards data to build a model predicting the Best Picture winner at the Academy Awards. One issue I run into, however, is that some HTML is understandably not written with scraping in mind. When I try to write a script that iterates through 601 movies, for example, the structure and naming of the data are inconsistent. The lack of standardization means writing modular functions for scraping data programmatically is difficult.

A recent Pew Research Center report showed how they used GPT-3.5 Turbo to collect data about podcast guests. My approach here is similar: I scrape what I can, give it to the OpenAI API along with a prompt, and then interpret the result.

I wanted to add two variables to my Oscar model:

  • Is the director of the film also a writer?

  • Is the director of the film also a producer?

The reasoning being that maybe directors who are famous for writing their own material (e.g., Paul Thomas Anderson, Sofia Coppola) are more or less likely for their films to win Best Picture. Similarly, perhaps being a producer as well as director means that the director has achieved some level of previous success that makes them more likely to take home Best Picture.

The difficulty of scraping this from Wikipedia is that the “infobox” (i.e., the light grey box at the top, right-hand side of the entry) does not follow the same structure, formatting, or naming conventions across pages.

Methodology

To get the data I want (a logical value for whether or not the director was also a writer and another logical value for if they were a producer), I took the following steps:

  1. Use the rvest package in R to pull down the “infobox” from the Wikipedia page and did my best to limit it to the information relevant to the director, writer, and producer

  2. Use the openai Python library to pass this information to GPT-3.5 Turbo or GPT-4

  3. Parse this result in R using the tidyverse to arrange the data nicely and append to my existing dataset for the Oscar model

Now, you could be asking: Why not use Python’s beautifulsoup4 in Step 1? Because I like rvest more and have more experience using it. And why not use R to access the OpenAI API? Because the official way in their documentation to access it is by using Python. Lastly, why not use pandas in Python to tidy the data afterward? Because I think the tidyverse in R is much easier of a way to clean data.

The great news: Posit’s RStudio IDE can handle both R and Python (among many other languages). The use of the reticulate R package also means we can import Python functions directly into an R session (and vice versa with rpy2). These are all just tools at the end of the day, so why not use the ones I’m comfortable, quickest, and most experienced with?

The Functions

I started with two files: funs.R and funs.py, which stored the functions I used.

funs.R is for pulling the data from the Wikipedia infobox, given the title and year of a film. I use this to search Wikipedia, get the URL of first result from the search results, and then scrape the infobox from that page:

#' Get the information box of a Wikipedia page
#'
#' Takes the title and year of a film, searches for it, gets the top result,
#' and pulls the information box at the top right of the page.
#'
#' @param title Title of the film
#' @param year Year the film was released
get_wikitext <- function(title, year) {
  tryCatch({
    tmp_tbl <- paste0(
      "https://en.wikipedia.org/w/index.php?search=",
      str_replace_all(title, " ", "+"),
      "+",
      year,
      "+film"
    ) %>% 
      rvest::read_html() %>% 
      rvest::html_nodes(".mw-search-result-ns-0:nth-child(1) a") %>% 
      rvest::html_attr("href") %>% 
      paste0("https://en.wikipedia.org", .) %>% 
      rvest::read_html() %>% 
      rvest::html_node(".vevent") %>%
      rvest::html_table() %>% 
      janitor::clean_names()
    
    # just relevant rows
    lgls <- grepl("Direct", tmp_tbl[[1]]) |
      grepl("Screen", tmp_tbl[[1]]) |
      grepl("Written", tmp_tbl[[1]]) |
      grepl("Produce", tmp_tbl[[1]])
    
    tmp_tbl <- tmp_tbl[lgls, ]
    
    # clean up random css
    # I have no idea how this works
    # I just got it online
    tmp_tbl[[2]] <- str_remove_all(tmp_tbl[[2]], "^.*?\\")
    tmp_tbl[[2]] <- str_remove_all(tmp_tbl[[2]], "^\\..*?(?=\n)")
    tmp_tbl[[2]] <- str_remove_all(tmp_tbl[[2]], "^.*?\\")
    tmp_tbl[[2]] <- str_remove_all(tmp_tbl[[2]], "^\\..*?(?=\n)")
    
    # print text
    apply(tmp_tbl, 1, \(x) paste0(x[[1]], ": ", x[[2]])) %>% 
      paste(collapse = ", ") %>% 
      str_replace_all("\n", " ")
  },
    error = \(x) NA
  )
}

An example output:

> get_wikitext("all that jazz", 1979)
[1] "Directed by: Bob Fosse, Written by: Robert Alan AurthurBob Fosse, Produced by: Robert Alan Aurthur"

Not perfect, but should be close enough. Sometimes it is closer, with different formatting:

> get_wikitext("la la land", 2016)
[1] "Directed by: Damien Chazelle, Written by: Damien Chazelle, Produced by:  Fred Berger Jordan Horowitz Gary Gilbert Marc Platt"

The result is then passed to the function defined in funs.py. That script is:

from openai import OpenAI
import ast

client = OpenAI(api_key='API_KEY_GOES_HERE')

def get_results(client, wikitext):
  chat_completion = client.chat.completions.create(
      messages=[
          {
              'role': 'user',
              'content': '''
              Below is a list that includes people involved with making a 
              movie. Each part corresponds to a different role that one might
              have in making the movie (such as director, writer, or producer).
              Could you tell me two things about the director? First,
              did the director also write the script/screenplay/story for the
              movie? And second, did the director also serve as a producer for
              the movie? Note that, in this list, names may not be separated by 
              spaces even when they should be. That is, names may run together 
              at times. You do not need to provide any explanation. Please reply 
              with a valid Python dictionary, where: 'writer' is followed by 
              True if the director also wrote the film and False if they did 
              not, and 'producer' is followed by True if they also produced the 
              film and False if they did not. If you cannot determine, you can 
              follow it with NA instead of True or False. The information is:
                
              ''' + wikitext
          }
      ],
      model='gpt-3.5-turbo'
  )
  
  # tidy result to make readable dict
  out = chat_completion.choices[0].message.content
  out = out.replace('\n', '')
  out = out.replace(' ', '')
  out = out.replace('true', 'True')
  out = out.replace('false', 'False')
  
  return(ast.literal_eval(out))

(I don’t have as good of documentation here because I’m not as familiar writing Python functions.)

Bringing It Together

I used an R script to use these functions in the same session. We start off by loading the R packages, sourcing the R script, activating the Python virtual environment (the path is relative to my file structure in my drive), and sourcing the Python script. I read in the data from a Google Sheet of mine and do one step of cleaning, as the read_sheet() function was bringing the title variable in as a list of lists instead of a character vector.

library(tidyverse)
library(reticulate)
source("funs.R")
use_virtualenv("../../")
source_python("funs.py")

dat <- googlesheets4::read_sheet("SHEET_ID_GOES_HERE") %>% 
  mutate(film = as.character(film))

I then initialize two new variables in the data: writer and producer. These will get populated with TRUE if the director also served as a writer or producer, respectively, and FALSE otherwise.

res <- dat %>% 
  select(year, film) %>% 
  mutate(writer = NA, producer = NA)

I iterate through each row using a for loop (I know this isn’t a very tidyverse way of doing things, as map_*() statements are preferred usually, but I felt it was easiest for making sense of the code and catching errors).

for (r in 1:nrow(res)) {
  cat(r, "\n")
  
  tmp_wikitext <- get_wikitext(res$film[r], res$year[r])
  
  # skip if get_wikitext fails
  if (is.na(tmp_wikitext)) next
  if (length(tmp_wikitext) == 0) next
  
  # give the text to openai
  tmp_chat <- tryCatch(
    get_results(client, tmp_wikitext),
    error = \(x) NA
  )
  
  # if openai returned a dict of 2
  if (length(tmp_chat) == 2) {
    res$writer[r] <- tmp_chat$writer
    res$producer[r] <- tmp_chat$producer
  }
}

I use cat() to track progress. I use the function from funs.R to pull down the text I want GPT-3.5 to extract information from. You’ll note that that function had a tryCatch() in it, because I didn’t want everything to stop at an error. Upon an error, it’ll just return an NA. I also found that sometimes it would read a different page successfully but then just return a blank character string. So if either of those are true, I say next to skip to the next row. This means I’m not wasting OpenAI tokens feeding it blanks.

Then I use a Python function inside of an R session! I use get_results(), which was defined in funs.py, to take the text from Wikipedia and give it to OpenAI. If there was an error, I again use tryCatch() to give me an NA instead of shutting the whole thing down. If there wasn’t an error, I add the values to the res data that I initialized above. Notably, the package knows that a Python dictionary should be brought in as a named logical list.

What we can see from this script is you can seamlessly use R and Python in one session, depending on the tools you have and what you’re comfortable with. A clickbait topic in data science for the last ten years or so has been “R or Python?” when really the answer is both: They play quite nicely with one another, thanks to the hard work of programmers who have developed packages like reticulate and Posit’s focus on languages beyond R.

Performance

Now that we’ve seen how one can use R and Python in harmony to access the OpenAI API, how well did GPT do? I compare both 3.5 Turbo and 4. The only change I had to make to funs.py to use GPT-4 was replacing 'gpt-3.5-turbo' with 'gpt-4'.

For each of the models, I did that for loop above three times, as the GPT models aren’t reproducible: They can give different answers each time you give them the same prompt (one of my beefs with this methodology). I only gave it rows that were still NA after each iteration to save on tokens. Especially with GPT-3.5, this gave me more data to work with.

NAs

Using GPT-3.5, I was able to get a valid result for 447 of the 601 films. This was 444 for GPT-4. The three films that GPT-3.5 coded but GPT-4 did not were Pulp Fiction (1994), Chariots of Fire (1981), and Smilin’ Through (1933).

One note is that, before 1934, Academy Awards spanned multiple years. However, I code them all with the same year for ease of analysis. But that means I may be giving the Wikipedia search wrong information, so it isn’t a failure at the OpenAI stage but at the Wikipedia scraping stage.

If we remove the films before 1934, GPT-3.5 coded 77.2% of the films, while GPT-4 coded 76.8%. This may not be GPT’s fault, however, as getting the text from Wikipedia might still have been where the pipeline produced an NA.

Accuracy

But how often was each model giving us the correct answer? I hand-coded a random sample of 100 movies and counted how often each model was correct. The four rows in the table below represent different combinations of being correct/incorrect.

Writer Correct Producer Correct n.3-5 n.4
FALSE FALSE 4 0
FALSE TRUE 10 2
TRUE FALSE 38 1
TRUE TRUE 48 97

The last row shows complete accuracy, where both coding for writer and producer were correct. The results are obvious in favor of GPT-4: It was fully correct 97% of the time, whereas GPT-3.5 Turbo was correct only 48% of the time. It was the coding of producer that sunk it: It was correct 86% of the time with writer, but only 58% of the time with producer. I feel confident using the GPT-4 data for my Oscar model; I told myself a priori I’d be good with anything >90% accurate (an arbitrary threshold, admittedly).

So, not really a surprise that the newer model performed better. But I am somewhat surprised that GPT-3.5 Turbo couldn’t extract information even when I was giving it very specific instructions and a mostly clean piece of text to examine. Maybe I just do not know how to talk to the model correctly? I brought this up with a group of colleagues, to which one said, “No idea but this is why I expect prompt engineering to be a major like next year,” and they may very well be correct.

Conclusion

  1. You can use R and Python together smoothly

  2. You can use the OpenAI API to efficiently do content coding for your research and models

  3. ALWAYS KEEP A HUMAN IN THE LOOP to check for accuracy and fairness

Modeling the Oscar for Best Picture (and Some Insights About XGBoost)

The Academy Awards are a week away, and I’m sharing my machine-learning-based predictions for Best Picture as well as some insights I took away from the process (particularly XGBoost’s sparsity-aware split finding). Oppenheimer is a heavy favorite at 97% likely to win—but major surprises are not uncommon, as we’ll see.

I pulled data from three sources. First, industry awards. Most unions and guilds for filmmakers—producers, directors, actors, cinematographers, editors, production designers—have their own awards. Second, critical awards. I collected as wide as possible, from the Golden Globes to the Georgia Film Critics Association. More or less: If an organization had a Wikipedia page showing a historical list of nominees and/or winners, I scraped it. Third, miscellaneous information like Metacritic score and keywords taken from synopses to learn if it was adapted from a book, what genre it is, the topics it covers, and so on. Combining all of these was a pain, especially for films that have bonkers names like BİRDMAN or (The Unexpected Virtue of Ignorance).

The source data generally aligns with what FiveThirtyEight used to do, except I casted a far wider net in collecting awards. Other differences include FiveThirtyEight choosing a closed-form solution for weighting the importance of awards and then rating films in terms of “points” they accrued (out of the potential pool of points) throughout the season. I chose to build a machine learning model, which was tricky.

To make the merging of data feasible (e.g., different tables had different spellings of the film or different years associated with the film), I only looked at the movies who received a nomination for Best Picture, making for a tiny dataset of 591 rows for the first 95 ceremonies. The wildly small N presents a challenge for building a machine learning model, as does sparsity and missing data.

Sparsity and Missing Data

There are a ton of zeroes in the data, creating sparsity. Every variable (save for the Metacritic score) is binary. Nomination variables (i.e., was the film nominated for the award?) may have multiple films for a given year with a 1, but winning variables (i.e., did the film win the award?) only have a single 1 each year.

There is also the challenge of missing data. Not every award in the model goes back to the late 1920s, meaning that each film has an NA if it was released in a year before a given award. For example, I only included Metacritic scores for contemporaneous releases, and the site launched in 2001, while the Screen Actors Guild started their awards in 1995.

My first thought was an ensemble model. Segment each group of awards, based on their start date, into different models. Get predicted probabilities from these, and combine them weighted on the inverse of out-of-sample error. After experimenting a bit, I came to the conclusion so many of us do when building models: Use XGBoost. With so little data to use for tuning, I simply stuck with model defaults for hyper-parameters.

Outside of its reputation for being accurate out of the box, it handles missing data. The docs simply state: “XGBoost supports missing values by default. In tree algorithms, branch directions for missing values are learned during training.” This is discussed in deeper detail in the “sparsity-aware split finding” section of the paper introducing XGBoost. The full algorithm is shown in that paper, but the general idea is that an optimal default direction at each split in a tree is learned from the data, and missing values follow that default.

Backtesting

To assess performance, I backtested on the last thirty years of Academy Awards. I believe scikit-learn would call this group k-fold cross-validation. I removed a given year from the dataset, fit the model, and then made predictions on the held-out year. The last hiccup is that the model does not know that if Movie A from Year X wins Best Picture, it means Movies B - E from Year X cannot. It also does not know that one of the films from Year X must win. My cheat around this is I re-scale all the predicted probabilities to sum to one.

The predictions for the last thirty years:

Year Predicted Winner Modeled Win Probability Won Best Picture? Actual Winner
1993 schindler’s list 0.996 1 schindler’s list
1994 forrest gump 0.990 1 forrest gump
1995 apollo 13 0.987 0 braveheart
1996 the english patient 0.923 1 the english patient
1997 titanic 0.980 1 titanic
1998 saving private ryan 0.938 0 shakespeare in love
1999 american beauty 0.995 1 american beauty
2000 gladiator 0.586 1 gladiator
2001 a beautiful mind 0.554 1 a beautiful mind
2002 chicago 0.963 1 chicago
2003 the lord of the rings: the return of the king 0.986 1 the lord of the rings: the return of the king
2004 the aviator 0.713 0 million dollar baby
2005 brokeback mountain 0.681 0 crash
2006 the departed 0.680 1 the departed
2007 no country for old men 0.997 1 no country for old men
2008 slumdog millionaire 0.886 1 slumdog millionaire
2009 the hurt locker 0.988 1 the hurt locker
2010 the king’s speech 0.730 1 the king’s speech
2011 the artist 0.909 1 the artist
2012 argo 0.984 1 argo
2013 12 years a slave 0.551 1 12 years a slave
2014 birdman 0.929 1 birdman
2015 spotlight 0.502 1 spotlight
2016 la la land 0.984 0 moonlight
2017 the shape of water 0.783 1 the shape of water
2018 roma 0.928 0 green book
2019 parasite 0.576 1 parasite
2020 nomadland 0.878 1 nomadland
2021 the power of the dog 0.981 0 coda
2022 everything everywhere all at once 0.959 1 everything everywhere all at once

Of the last 30 years, 23 predicted winners actually won, while 7 lost—making for an accuracy of about 77%. Not terrible. (And, paradoxically, many of the misses are predictable ones to those familiar with Best Picture history.) However, the mean predicted probability of winning from these 30 cases is about 85%, which means the model is maybe 8 points over-confident. We do see recent years being more prone to upsets—is that due to a larger pool of nominees? Or something else, like a change in the Academy’s makeup or voting procedures? At any rate, some ideas I am going to play with before next year are weighting more proximate years higher (as rules, voting body, voting trends, etc., change over time), finding additional awards, and pulling in other metadata on films. It might just be, though, that the Academy likes to swerve away from everyone else sometimes in a way that is not readily predictable from outside data sources. (Hence the fun of watching and speculating and modeling in the first place.)

This Year

I wanted to include a chart showing probabilities over time, but the story has largely remained the same. The major inflection point was the Directors Guild of America (DGA) Awards.

Of the data we had on the day the nominees were announced (January 23rd), the predictions were:

Film Predicted Probability
Killers of the Flower Moon 0.549
The Zone of Interest 0.160
Oppenheimer 0.147
American Fiction 0.061
Barbie 0.039
Poor Things 0.023
The Holdovers 0.012
Past Lives 0.005
Anatomy of a Fall 0.005
Maestro 0.001

I was shocked to see Oppenheimer lagging in third and to see The Zone of Interest so high. The reason here is that, while backtesting, I saw that the variable importance for winning the DGA award for Outstanding Directing - Feature Film was the highest by about a factor of ten. Since XGBoost handles missing values nicely, we can rely on the sparsity-aware split testing to get a little more information from these data. If we know the nominees of an award but not the winner yet, we can still infer: Anyone who was nominated is left NA, while anyone who was not nominated is set to zero. That allows us to partially use this DGA variable (and the other awards where we knew the nominees on January 23rd, but not the winners). When we do that, the predicted probabilities as of the announcing of the Best Picture nominees were:

Film Predicted Probability
Killers of the Flower Moon 0.380
Poor Things 0.313
Oppenheimer 0.160
The Zone of Interest 0.116
American Fiction 0.012
Barbie 0.007
Past Lives 0.007
Maestro 0.003
Anatomy of a Fall 0.002
The Holdovers 0.001

The Zone of Interest falls in favor of Poor Things, since the former was not nominated for the DGA award while the latter was. I was still puzzled, but I knew that the model wouldn’t start being certain until we knew the DGA award. Those top three films were nominated for many of the same awards. Then Christopher Nolan won the DGA award for Oppenheimer, and the film hasn’t been below a 95% chance for winning Best Picture since.

Final Predictions

The probabilities as they stand today, a week before the ceremony, have Oppenheimer as the presumptive winner at a 97% chance of winning.

Film Predicted Probability
Oppenheimer 0.973
Poor Things 0.010
Killers of the Flower Moon 0.005
The Zone of Interest 0.004
Anatomy of a Fall 0.003
American Fiction 0.002
Past Lives 0.001
Barbie 0.001
The Holdovers 0.001
Maestro 0.000

There are a few awards being announced tonight (Satellite Awards, the awards for the cinematographers guild and the edtiors guild), but they should not impact the model much. So, we are in for a year of a predictable winner—or another shocking year where a CODA or a Moonlight takes home film’s biggest award. (If you’ve read this far and enjoyed Cillian Murphy in Oppenheimer… go check out his leading performance in Sunshine, directed by Danny Boyle and written by Alex Garland.)

Supervised Topic Modeling for Short Texts: My Workflow and A Worked Example

Many organizations have a substantial amount of human-generated text from which they are not extracting a proportional amount of insight. For example, open-ended questions are found in most surveys—but are rarely given the same amount of attention (if any attention at all) as the easier-to-analyze quantitative data. I have tested out many supposedly “AI-powered” or “NLP-driven” tools for analyzing text in my career, and I haven’t found anything to be useful at finding topics or modeling sentiment when fed real data. I wrote on my reservations about common topic modeling methods over four years ago, where I showed how I perform exploratory analysis on text data based on word co-occurrences. That was an unsupervised approach: No a priori topics are given to a model to learn from. It looks at patterns of how frequently words are used together to infer topics.

I lay out my approach for supervised topic modeling in short texts (e.g., open-response survey data) here. My philosophy is one where there is no free lunch: If you want data that make sense for your specific organization and use case, you’re gonna actually have to read and consider the text you get. This isn’t using statistical learning to do the job for you—it is using these models as a tool to work with. You can’t solve this purely from the command line. There will need to be many humans in the loop. This process can be a grind, but what you’ll come out of it with is a scalable, bespoke model. I don’t think I’m saying anything particularly revolutionary in this post, but hopefully the walk-through and code might help you develop a similar workflow.

The working example I’m using here are Letterboxd reviews for Wes Anderson’s newest film, Asteroid City.

Step 1: Thematic Content Analysis

Create a coding corpus that is a random sample (maybe 2000 cases, depending on how long each piece of text is, how much bandwidth you have for this project, etc.) from your entire corpus of text (e.g., reviews, open-ended responses, help tickets). Perform a thematic content analysis in partnership with colleagues. I am not a qualitative methodologist by any means, but I start by reading the entirety of the random sample. Then I read through the coding corpus again, taking notes about key themes. I look at these themes, and I think about how they may be grouped together. I talk with my colleagues to see what they think. We discuss if certain themes can be combined or if they need to be kept separate. We talk about the maximum number of themes we think we should be looking for. We finalize themes, discuss what they mean in detail, and name them. From these readings and conversations, I write a standardized coding manual that others can use to read responses and code according to the themes.

In my example, I wrote an R script (see Appendix A or my GitHub) to scrape Letterboxd for reviews. For each possible rating (0.5 stars to 5.0 stars, in increments of half stars), I pulled six pages. Each page had twelve reviews. Ten possible ratings, six pages each, and twelve reviews per page multiplies out to 720 reviews. In the process of coding, I dropped any review that was not written in English—a limitation of mine here due to me making the regrettable decision to study Latin in college (although an apropos course for a Wes Anderson blog post).

I decided to stick to just one theme for the purposes of this exercise: Did the review discuss the visual style of the film? Anderson is known for his distinctive visual style, which at this point is unfortunately a bit of a cultural meme. There was a whole Instagram/TikTok trend of posting videos of oneself in the style of Anderson; there’s those cynical, intellectually vapid, artless, AI-generated simulacra imagining [insert well-known intellectual property here] in the Andersonian style; and there’s even a Hertz (the rent-a-car company) commercial that references the meme, which in turn references the style (Baudrillard’s head would be spinning, folks).

This visual style, by my eye, was solidified by the time of The Grand Budapest Hotel. Symmetrical framing, meticulously organized sets, the use of miniatures, a pastel color palette contrasted with saturated primary colors, distinctive costumes, straight lines, lateral tracking shots, whip pans, and so on. However, I did not consider aspects of Anderson’s style that are unrelated to the visual sense. This is where defining the themes with a clear line matters—and often there will be ambiguities, but one must do their best, because the process we’re doing is fundamentally a simplification of the rich diversity of the text. Thus, I did not consider the following to be in Anderson’s visual style: stories involving precocious children, fraught familial relations, uncommon friendships, dry humor, monotonous dialogue, soundtracks usually involving The Kinks, a fascination with stage productions, nesting doll narratives, or a decidedly twee yet bittersweet tone.

Step 2: Independent Coders

This is where you will need organizational buy-in and the help of your colleagues. Recruit fellow researchers (or subject experts) that have domain expertise. For example, if the text data are all reviews for a specific app, make sure you recruit colleagues that have knowledge about that specific app. I generate standardized Google Sheets where each row is a text response, each column is a theme, and I send them—along with the coding manual—to the coders. They are instructed to put in a “1” wherever a text touches on that theme as is described in the coding manual. I have each piece of text coded by two independent coders. After everyone is done, I resolve disagreements by a third-party vote (which may or may not be myself). Combine the various Google Sheets (I use the R package and the suite of packages for merging these data), and congratulations! You have supervised text data that are ready for training.

For Asteroid City reviews, I just did this on a Saturday afternoon myself. Any piece of text that touched on the visual style was flagged with a “1,” while the rest were marked as “0.” The text I pulled was abbreviated for longer reviews; however, I don’t mind this, as I feel it stays true to the analogy of short open-ended survey responses that way.

Step 3: Train the Models

This is where I stand on the shoulders of Emil Hvitfeldt and Julia Silge by way of their book, Supervised Machine Learning for Text Analysis in R. I define an entire machine learning workflow (i.e., data processing, cross-validation, model selection, and holdout performance), and then I package this workflow into convenient wrapper functions. This allows me to use that wrapper function on each column, so that every theme gets its own model. I write the cross-validation results and final models out as RDS files so that I can access them later.

The wrapper functions are in Appendix B below and at my GitHub. It’s, uh, long. (I go against my own imperative to write modular code here, but if there’s a less verbose way of defining this workflow, please genuinely let me know.) I am making three basic functions: one to create the train/test split, another to perform cross-validation, and a third to pull the best model.

The prep_data function takes your data, the name of the outcome and text variables, and how large of a holdout set you want.

You can change exactly what pre-processing steps and algorithms you want to use from what I have. But I have found that the processing of the data usually is more influential than a specific model. So in the wrapper function do_cv, I am trying out all combinations of the following:

  • Stop words: None, SMART, or ISO (different lists of stop words)
  • N-grams: Words only or words and bigrams
  • Stemming: Yes or no
  • Word frequency filtering: Words used 1 or more times in the corpus, 2 or more times, or 5 or more
  • Algorithm: Elastic net or random forest

This leaves 3 * 2 * 2 * 3 * 2 = 72 combinations of pre-processors and algorithm combinations. The models were tuned according to grid search (the size of which is an argument to the do_cv function) using k-fold validation (where k is also an argument). You could add other pre-processing steps (e.g., word embeddings) or other algorithms (e.g., XGBoost, neural networks) to the body of this function.

When that’s done, the best_model function will pull everything you need to know about the model that performed best in cross-validation, and it will give you performance metrics on the holdout set.

When all is said and done, the actual running of the code then looks deceptively simple:

library(tidyverse)
source("funs.R")

dat <- read_csv("ratings_coded.csv") %>% 
  filter(visual_style != 9) # remove non-english

set.seed(1839)
dat <- prep_data(dat, "visual_style", "text", .20)
cv_res <- do_cv(training(dat), 4, 10)
write_rds(cv_res, "visual_style-res.rds")
mod <- best_model(dat, cv_res, "roc_auc")

Where funs.R is a file that includes the functions in Appendix B. This has a 20% holdout set, 4 folds, and a grid search with size 10. It then writes out those results to an RDS object. Lastly, we pull the best model according to the area under the ROC curve. If you had multiple codes, you could write a for loop here for the name of each column (e.g., prep_data(dat, i, "text", .20)).

Step 4: Use the Models

First, let’s check out what some of these objects look like. Let’s take a look at the metrics for the first model that went into the cross-validation results set.

cv_res <- read_rds("visual_style-res.rds")
cv_res$cv_res$result[[1]]$.metrics
## [[1]]
## # A tibble: 60 × 6
##        penalty mixture .metric     .estimator .estimate .config              
##          <dbl>   <dbl> <chr>       <chr>          <dbl> <chr>                
##  1 0.000000787  0.0698 accuracy    binary         0.774 Preprocessor1_Model01
##  2 0.000000787  0.0698 sensitivity binary         0.923 Preprocessor1_Model01
##  3 0.000000787  0.0698 specificity binary         0     Preprocessor1_Model01
##  4 0.000000787  0.0698 precision   binary         0.828 Preprocessor1_Model01
##  5 0.000000787  0.0698 f_meas      binary         0.873 Preprocessor1_Model01
##  6 0.000000787  0.0698 roc_auc     binary         0.777 Preprocessor1_Model01
##  7 0.000374     0.166  accuracy    binary         0.774 Preprocessor1_Model02
##  8 0.000374     0.166  sensitivity binary         0.923 Preprocessor1_Model02
##  9 0.000374     0.166  specificity binary         0     Preprocessor1_Model02
## 10 0.000374     0.166  precision   binary         0.828 Preprocessor1_Model02
## # ℹ 50 more rows
## 
## [[2]]
## # A tibble: 60 × 6
##        penalty mixture .metric     .estimator .estimate .config              
##          <dbl>   <dbl> <chr>       <chr>          <dbl> <chr>                
##  1 0.000000787  0.0698 accuracy    binary         0.767 Preprocessor1_Model01
##  2 0.000000787  0.0698 sensitivity binary         0.913 Preprocessor1_Model01
##  3 0.000000787  0.0698 specificity binary         0.286 Preprocessor1_Model01
##  4 0.000000787  0.0698 precision   binary         0.808 Preprocessor1_Model01
##  5 0.000000787  0.0698 f_meas      binary         0.857 Preprocessor1_Model01
##  6 0.000000787  0.0698 roc_auc     binary         0.714 Preprocessor1_Model01
##  7 0.000374     0.166  accuracy    binary         0.767 Preprocessor1_Model02
##  8 0.000374     0.166  sensitivity binary         0.913 Preprocessor1_Model02
##  9 0.000374     0.166  specificity binary         0.286 Preprocessor1_Model02
## 10 0.000374     0.166  precision   binary         0.808 Preprocessor1_Model02
## # ℹ 50 more rows
## 
## [[3]]
## # A tibble: 60 × 6
##        penalty mixture .metric     .estimator .estimate .config              
##          <dbl>   <dbl> <chr>       <chr>          <dbl> <chr>                
##  1 0.000000787  0.0698 accuracy    binary         0.667 Preprocessor1_Model01
##  2 0.000000787  0.0698 sensitivity binary         1     Preprocessor1_Model01
##  3 0.000000787  0.0698 specificity binary         0.167 Preprocessor1_Model01
##  4 0.000000787  0.0698 precision   binary         0.643 Preprocessor1_Model01
##  5 0.000000787  0.0698 f_meas      binary         0.783 Preprocessor1_Model01
##  6 0.000000787  0.0698 roc_auc     binary         0.824 Preprocessor1_Model01
##  7 0.000374     0.166  accuracy    binary         0.667 Preprocessor1_Model02
##  8 0.000374     0.166  sensitivity binary         1     Preprocessor1_Model02
##  9 0.000374     0.166  specificity binary         0.167 Preprocessor1_Model02
## 10 0.000374     0.166  precision   binary         0.643 Preprocessor1_Model02
## # ℹ 50 more rows
## 
## [[4]]
## # A tibble: 60 × 6
##        penalty mixture .metric     .estimator .estimate .config              
##          <dbl>   <dbl> <chr>       <chr>          <dbl> <chr>                
##  1 0.000000787  0.0698 accuracy    binary         0.733 Preprocessor1_Model01
##  2 0.000000787  0.0698 sensitivity binary         0.913 Preprocessor1_Model01
##  3 0.000000787  0.0698 specificity binary         0.143 Preprocessor1_Model01
##  4 0.000000787  0.0698 precision   binary         0.778 Preprocessor1_Model01
##  5 0.000000787  0.0698 f_meas      binary         0.84  Preprocessor1_Model01
##  6 0.000000787  0.0698 roc_auc     binary         0.745 Preprocessor1_Model01
##  7 0.000374     0.166  accuracy    binary         0.7   Preprocessor1_Model02
##  8 0.000374     0.166  sensitivity binary         0.870 Preprocessor1_Model02
##  9 0.000374     0.166  specificity binary         0.143 Preprocessor1_Model02
## 10 0.000374     0.166  precision   binary         0.769 Preprocessor1_Model02
## # ℹ 50 more rows

What we see here is the 6 metrics for each of the 10 tunings from the grid search. We get four tibbles here—one for each fold. The .config tells us that this is pre-processor 1. What does this refer to?

cv_res$wfs$wflow_id[[1]]
## [1] "word_nostop_nostem_f2_elasticnet"

This is looking at just words, no stop words, no stemming, a minimum use of 2 in the corpus, and the elastic net.

But what we want to know is which model is best. We used best_model for that above.

mod$best_id
## [1] "word_smart_stemmed_f2_randforest"
mod$best_params
## # A tibble: 1 × 3
##    mtry min_n .config              
##   <int> <int> <chr>                
## 1    20    34 Preprocessor1_Model09

What performed best in cross-validation was only looking at words, using the SMART dictionary for stop words, stemming, only considering words used at least twice in the corpus, and using the random forest while considering 20 words at a time with a minimum leaf size of 34.

We can also look at the holdout metrics:

mod$ho_metrics
## # A tibble: 1 × 7
##   accuracy precision sensitivity specificity est_pct_class est_pct_prob act_pct
##      <dbl>     <dbl>       <dbl>       <dbl>         <dbl>        <dbl>   <dbl>
## 1    0.765     0.762       0.994       0.110        0.0327        0.191   0.259

You’ll be familiar with the first four, but the last three are ones I use. Often, I am using these types of models to estimate, “How many people are talking about X theme?” So, I don’t care as much about individual-level predictions as I do about the aggregate. I do this by averaging up the estimated probabilities (est_pct_prob) and using that as my estimate. So here, I would say that about 19% of reviews are discussing Anderson’s visual style. The actual percent (act_pct) is a bit more at about 26%. But not bad for a Saturday afternoon.

What about variable importance? What words are helping us figure out if the review touches on Wes’s visual style or not?

print(mod$var_imp, n = 10)
## # A tibble: 368 × 2
##    Variable Importance
##    <fct>         <dbl>
##  1 color         1.56 
##  2 visual        1.35 
##  3 cool          1.02 
##  4 audienc       0.994
##  5 shot          0.940
##  6 plot          0.751
##  7 design        0.575
##  8 pretti        0.554
##  9 set           0.503
## 10 landscap      0.492
## # ℹ 358 more rows

And yeah, this makes sense. The last thing we can do is plug in new reviews that have gone up since I made the model and generate predictions for them. The first and the third talk about visual style, while the others do not.

new_reviews <- tibble(
  text = c("Kodak film and miniatures make for a cool movie",
           "To me, Wes never misses",
           paste(
             "Visuals and characters were stunning as always but the dual plot",
             "was hard to follow and the whole thing felt a bit hollow"
            ),
           "my head hurts idk how i feel about this yet",
           "Me after asking AI to write me a wes anderson script")
)

# predict on these
predict(mod$fit, new_reviews, type = "prob")
## # A tibble: 5 × 2
##   .pred_0 .pred_1
##     <dbl>   <dbl>
## 1   0.675  0.325 
## 2   0.955  0.0448
## 3   0.545  0.455 
## 4   0.968  0.0317
## 5   0.870  0.130

Unfortunately, we don’t get hits on our two that actually are about his style. In the aggregate, we see 40% of these new reviews about his style. Averaging up the .pred_1 column only gives us about 20%, but we can’t be surprised about that big of a miss from just five new reviews.

Step 5: Scale and Maintain

You’ve got the best model(s) in-hand. What now? You scale it! In Step 1, we only looked at a random sample of maybe 2000 pieces of text. But you may have much, much more than that; you might be continuously collecting new text data that adhere to the coding manual. Use the predict() function on those cases. Then what I do is average up the predicted probabilities to get an estimate of how many people are talking about a given theme. But you could also use the predicted probabilities to decide which users to email about a specific theme, for example. Since your coding scheme came from a specific prompt or data source, make sure that you are applying it only to text that come from this same generating mechanism (e.g., survey question, feedback prompt, ticketing system).

What is absolutely crucial here is keeping humans in the loop. As new data come in, figure out a regular cadence where you generate a (smaller than the original) random sample of data, code it, and then see how your predictions are holding up to new data. (After checking performance, these newly coded cases can then be added to the training set and the model can be updated to include them.) Also, read through these data to see if the coding manual needs to be updated, themes are no longer relevant, or new themes need to be added. This is, again, a grind of a process. But it is a way of scaling up the unique expertise of the humans in your organization to data that generally doesn’t get much attention. And, in my experience, keeping humans in the loop is not only good for model performance, but reading real responses from real people helps you as a researcher understand your data and your users (or respondents or whoever is generating the data) better.

Appendix A

library(rvest)
library(tidyverse)

# 12 reviews per page
# 10 possible ratings
# 6 pages each
# n = 720 reviews total

base_url1 <- "https://letterboxd.com/film/asteroid-city/reviews/rated/"
base_url2 <- "/by/activity/page/"
ratings <- seq(.5, 5, .5)
pages <- 6

# people often say to use purrr::map or an apply statement instead of a for
# loop. but if it fails, I want to know where, and I want to keep the data
# we already collected. so I'm initializing an empty data frame and then
# slotting the data in one page at a time
dat <- tibble(rating = vector("double"), text = vector("character"))

for (r in ratings) {
  pg <- 1
  
  while (pg < 7) {
    url <- paste0(base_url1, r, base_url2, pg)
    
    txt <- url %>% 
      read_html() %>% 
      html_nodes(".collapsible-text") %>% 
      html_text2() %>% 
      map_chr(str_replace_all, fixed("\n"), " ")
    
    dat <- bind_rows(dat, tibble(rating = r, text = txt))
    
    cat("finished page", pg, "of", r, "stars\n")
    
    pg <- pg + 1
    Sys.sleep(runif(1, 0, 10))
  }
}

write_csv(dat, "ratings_uncoded.csv")

Appendix B

# funs -------------------------------------------------------------------------
library(textrecipes)
library(vip)
library(stopwords)
library(tidymodels)
library(workflowsets)
library(tidyverse)

# prepare data for modeling, do train/test split
# default at a holdout set of 15%
# stratify on the outcome variable
prep_data <- function(dat, y, txt, prop) {
  dat %>%
    transmute(text = .data[[txt]], y = factor(.data[[y]])) %>%
    initial_split(prop, strata = y)
}

# do cross-validation with the same, pre-defined, hard-coded engines and recipes
# n_folds is number of folds; grid_size is the size of the grid
do_cv <- function(dat_train, n_folds, grid_size) {
  
  # define recipes -------------------------------------------------------------
  ## base ----------------------------------------------------------------------
  rec_base <- recipe(y ~ text, dat_train)
  
  ## tokenize ------------------------------------------------------------------
  rec_word_nostop <- rec_base %>%
    step_tokenize(
      text,
      token = "words"
    )
  
  rec_word_smart <- rec_base %>%
    step_tokenize(
      text,
      token = "words",
      options = list(stopwords = stopwords(source = "smart"))
    )
  
  rec_word_iso <- rec_base %>%
    step_tokenize(
      text,
      token = "words",
      options = list(stopwords = stopwords(source = "stopwords-iso"))
    )
  
  rec_both_nostop <- rec_base %>%
    step_tokenize(
      text,
      token = "skip_ngrams",
      options = list(
        n = 2,
        k = 0
      )
    )
  
  rec_both_smart <- rec_base %>%
    step_tokenize(
      text,
      token = "skip_ngrams",
      options = list(
        stopwords = stopwords(source = "smart"),
        n = 2,
        k = 0
      )
    )
  
  rec_both_iso <- rec_base %>%
    step_tokenize(
      text,
      token = "skip_ngrams",
      options = list(
        stopwords = stopwords(source = "stopwords-iso"),
        n = 2,
        k = 0
      )
    )
  
  ## stem ----------------------------------------------------------------------
  rec_word_nostop_stemmed <- rec_word_nostop %>%
    step_stem(text)
  
  rec_word_smart_stemmed <- rec_word_smart %>%
    step_stem(text)
  
  rec_word_iso_stemmed <- rec_word_iso %>%
    step_stem(text)
  
  rec_both_nostop_stemmed <- rec_both_nostop %>%
    step_stem(text)
  
  rec_both_smart_stemmed <- rec_both_smart %>%
    step_stem(text)
  
  rec_both_iso_stemmed <- rec_both_iso %>%
    step_stem(text)
  
  ## filter, weight ------------------------------------------------------------
  rec_word_nostop_f2 <- rec_word_nostop %>%
    step_tokenfilter(text, min_times = 2, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_word_smart_f2 <- rec_word_smart %>%
    step_tokenfilter(text, min_times = 2, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_word_iso_f2 <- rec_word_iso %>%
    step_tokenfilter(text, min_times = 2, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_both_nostop_f2 <- rec_both_nostop %>%
    step_tokenfilter(text, min_times = 2, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_both_smart_f2 <- rec_both_smart %>%
    step_tokenfilter(text, min_times = 2, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_both_iso_f2 <- rec_both_iso %>%
    step_tokenfilter(text, min_times = 2, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_word_nostop_stemmed_f2 <- rec_word_nostop_stemmed %>%
    step_tokenfilter(text, min_times = 2, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_word_smart_stemmed_f2 <- rec_word_smart_stemmed %>%
    step_tokenfilter(text, min_times = 2, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_word_iso_stemmed_f2 <- rec_word_iso_stemmed %>%
    step_tokenfilter(text, min_times = 2, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_both_nostop_stemmed_f2 <- rec_both_nostop_stemmed %>%
    step_tokenfilter(text, min_times = 2, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_both_smart_stemmed_f2 <- rec_both_smart_stemmed %>%
    step_tokenfilter(text, min_times = 2, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_both_iso_stemmed_f2 <- rec_both_iso_stemmed %>%
    step_tokenfilter(text, min_times = 2, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_word_nostop_f5 <- rec_word_nostop %>%
    step_tokenfilter(text, min_times = 5, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_word_smart_f5 <- rec_word_smart %>%
    step_tokenfilter(text, min_times = 5, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_word_iso_f5 <- rec_word_iso %>%
    step_tokenfilter(text, min_times = 5, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_both_nostop_f5 <- rec_both_nostop %>%
    step_tokenfilter(text, min_times = 5, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_both_smart_f5 <- rec_both_smart %>%
    step_tokenfilter(text, min_times = 5, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_both_iso_f5 <- rec_both_iso %>%
    step_tokenfilter(text, min_times = 5, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_word_nostop_stemmed_f5 <- rec_word_nostop_stemmed %>%
    step_tokenfilter(text, min_times = 5, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_word_smart_stemmed_f5 <- rec_word_smart_stemmed %>%
    step_tokenfilter(text, min_times = 5, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_word_iso_stemmed_f5 <- rec_word_iso_stemmed %>%
    step_tokenfilter(text, min_times = 5, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_both_nostop_stemmed_f5 <- rec_both_nostop_stemmed %>%
    step_tokenfilter(text, min_times = 5, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_both_smart_stemmed_f5 <- rec_both_smart_stemmed %>%
    step_tokenfilter(text, min_times = 5, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_both_iso_stemmed_f5 <- rec_both_iso_stemmed %>%
    step_tokenfilter(text, min_times = 5, max_tokens = 5000) %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_word_nostop <- rec_word_nostop %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_word_smart <- rec_word_smart %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_word_iso <- rec_word_iso %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_both_nostop <- rec_both_nostop %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_both_smart <- rec_both_smart %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_both_iso <- rec_both_iso %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_word_nostop_stemmed <- rec_word_nostop_stemmed %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_word_smart_stemmed <- rec_word_smart_stemmed %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_word_iso_stemmed <- rec_word_iso_stemmed %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_both_nostop_stemmed <- rec_both_nostop_stemmed %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_both_smart_stemmed <- rec_both_smart_stemmed %>%
    step_tf(text, weight_scheme = "binary")
  
  rec_both_iso_stemmed <- rec_both_iso_stemmed %>%
    step_tf(text, weight_scheme = "binary")
  
  ## define specs --------------------------------------------------------------
  spec_elasticnet <- logistic_reg(
    mode = "classification",
    engine = "glmnet",
    penalty = tune(),
    mixture = tune()
  )
  
  spec_randforest <- rand_forest(
    mode = "classification",
    mtry = tune(),
    min_n = tune(),
    trees = 500
  ) %>%
    set_engine(engine = "ranger", importance = "impurity")
  
  # make workflowset -----------------------------------------------------------
  wfs <- workflow_set(
    preproc = list(
      word_nostop_nostem_f2 = rec_word_nostop_f2,
      word_smart_nostem_f2 = rec_word_smart_f2,
      word_iso_nostem_f2 = rec_word_iso_f2,
      both_nostop_nostem_f2 = rec_both_nostop_f2,
      both_smart_nostem_f2 = rec_both_smart_f2,
      both_iso_nostem_f2 = rec_both_iso_f2,
      word_nostop_stemmed_f2 = rec_word_nostop_stemmed_f2,
      word_smart_stemmed_f2 = rec_word_smart_stemmed_f2,
      word_iso_stemmed_f2 = rec_word_iso_stemmed_f2,
      both_nostop_stemmed_f2 = rec_both_nostop_stemmed_f2,
      both_smart_stemmed_f2 = rec_both_smart_stemmed_f2,
      both_iso_stemmed_f2 = rec_both_iso_stemmed_f2,
      word_nostop_nostem_f5 = rec_word_nostop_f5,
      word_smart_nostem_f5 = rec_word_smart_f5,
      word_iso_nostem_f5 = rec_word_iso_f5,
      both_nostop_nostem_f5 = rec_both_nostop_f5,
      both_smart_nostem_f5 = rec_both_smart_f5,
      both_iso_nostem_f5 = rec_both_iso_f5,
      word_nostop_stemmed_f5 = rec_word_nostop_stemmed_f5,
      word_smart_stemmed_f5 = rec_word_smart_stemmed_f5,
      word_iso_stemmed_f5 = rec_word_iso_stemmed_f5,
      both_nostop_stemmed_f5 = rec_both_nostop_stemmed_f5,
      both_smart_stemmed_f5 = rec_both_smart_stemmed_f5,
      both_iso_stemmed_f5 = rec_both_iso_stemmed_f5,
      word_nostop_nostem_f0 = rec_word_nostop,
      word_smart_nostem_f0 = rec_word_smart,
      word_iso_nostem_f0 = rec_word_iso,
      both_nostop_nostem_f0 = rec_both_nostop,
      both_smart_nostem_f0 = rec_both_smart,
      both_iso_nostem_f0 = rec_both_iso,
      word_nostop_stemmed_f0 = rec_word_nostop_stemmed,
      word_smart_stemmed_f0 = rec_word_smart_stemmed,
      word_iso_stemmed_f0 = rec_word_iso_stemmed,
      both_nostop_stemmed_f0 = rec_both_nostop_stemmed,
      both_smart_stemmed_f0 = rec_both_smart_stemmed,
      both_iso_stemmed_f0 = rec_both_iso_stemmed
    ),
    models = list(
      elasticnet = spec_elasticnet,
      randforest = spec_randforest
    ),
    cross = TRUE
  )
  
  # do cross-validation
  folds <- vfold_cv(dat_train, v = n_folds)
  
  # get result of cross-validation
  cv_res <- wfs %>%
    workflow_map(
      "tune_grid",
      grid = grid_size,
      resamples = folds,
      metrics = metric_set(
        accuracy,
        sensitivity,
        specificity,
        precision,
        f_meas,
        roc_auc
      ),
      verbose = TRUE
    )
  
  # return the entire workflow set and the results of cross-validation
  return(list(wfs = wfs, cv_res = cv_res))
}

# take results from cross-validation, get a final model and held-out metrics
best_model <- function(dat, cv_out, metric = "roc_auc") {
  
  # get the id of the best model, as according to the specified metric
  best_id <- cv_out$cv_res %>%
    rank_results(rank_metric = metric) %>%
    filter(.metric == metric & rank == 1) %>%
    pull(wflow_id)
  
  # get the name of the model
  # if you name the workflow sets differently, this step will change
  # it's based on the hardcoded names I gave them in defining the workflow set
  m <- str_split(best_id, "_")[[1]][[5]]
  
  # get best parameters, do final fit
  best_params <- cv_out$cv_res %>%
    extract_workflow_set_result(best_id) %>%
    select_best(metric = metric)
  
  final_fit <- cv_out$cv_res %>%
    extract_workflow(best_id) %>%
    finalize_workflow(best_params) %>%
    fit(training(dat))
  
  # run it on the holdout data
  dat_test <- testing(dat)
  
  # if glmnet, feed it the right penalty
  # I'm pretty sure this is necessary,
  # because glmnet fits many lambda as it's more efficient
  if (m == "elasticnet") {
    dat_test <- bind_cols(
      dat_test,
      predict(final_fit, dat_test, penalty = best_params$penalty),
      predict(final_fit, dat_test, type = "prob", penalty = best_params$penalty)
    )
  } else {
    dat_test <- bind_cols(
      dat_test,
      predict(final_fit, dat_test),
      predict(final_fit, dat_test, type = "prob")
    )
  }
  
  # define metrics to output
  ms <- metric_set(accuracy, sensitivity, specificity, precision)
  
  # return metrics
  metrics_res <- ms(dat_test, truth = y, estimate = .pred_class) %>%
    select(-.estimator) %>%
    spread(.metric, .estimate) %>%
    mutate(
      est_pct_class = mean(dat_test$.pred_class == 1),
      est_pct_prob = mean(dat_test$.pred_1),
      act_pct = mean(dat_test$y == 1)
    )
  
  # get variable importance, which depends on the model
  if (m == "randforest") {
    var_imp <- final_fit %>%
      extract_fit_parsnip() %>%
      vi() %>%
      mutate(
        Variable = str_remove_all(Variable, "tf_text_"),
        Variable = factor(Variable, Variable)
      )
  } else if (m == "elasticnet") {
    var_imp <- final_fit %>%
      tidy() %>%
      filter(
        penalty == best_params$penalty &
          term != "(Intercept)" &
          estimate > 0
      ) %>%
      arrange(desc(abs(estimate))) %>%
      select(-penalty) %>%
      mutate(
        term = str_remove_all(term, "tf_text_"),
        term = factor(term, term)
      )
  } else {
    # if you add more models, this is where you'd change the code to get
    # variable importance for that specific class of model output
    var_imp <- NA
  }
  
  return(
    list(
      best_id = best_id,
      best_params = best_params,
      var_imp = var_imp,
      fit = final_fit,
      ho_metrics = metrics_res
    )
  )
}