Well, it looks like the time has finally come for me to join the club
and write a large language model (LLM) blog post. I hope to do two
things here:
In my previous blog
post, I discussed scraping film awards data to build a model
predicting the Best Picture winner at the Academy Awards. One issue I
run into, however, is that some HTML is understandably not written with
scraping in mind. When I try to write a script that iterates through 601
movies, for example, the structure and naming of the data are
inconsistent. The lack of standardization means writing modular
functions for scraping data programmatically is difficult.
A recent
Pew Research Center report showed how they used GPT-3.5 Turbo to
collect data about podcast guests. My approach here is similar: I scrape
what I can, give it to the OpenAI API along with a prompt, and then
interpret the result.
I wanted to add two variables to my Oscar model:
The reasoning being that maybe directors who are famous for writing
their own material (e.g., Paul Thomas Anderson, Sofia Coppola) are more
or less likely for their films to win Best Picture. Similarly, perhaps
being a producer as well as director means that the director has
achieved some level of previous success that makes them more likely to
take home Best Picture.
The difficulty of scraping this from Wikipedia is that the “infobox”
(i.e., the light grey box at the top, right-hand side of the entry) does
not follow the same structure, formatting, or naming conventions across
pages.
Methodology
To get the data I want (a logical value for whether or not the
director was also a writer and another logical value for if they were a
producer), I took the following steps:
Use the rvest
package in R to pull down the
“infobox” from the Wikipedia page and did my best to limit it to the
information relevant to the director, writer, and producer
Use the openai
Python library to pass this
information to GPT-3.5 Turbo or GPT-4
Parse this result in R using the tidyverse
to
arrange the data nicely and append to my existing dataset for the Oscar
model
Now, you could be asking: Why not use Python’s
beautifulsoup4
in Step 1? Because I like rvest
more and have more experience using it. And why not use R to access the
OpenAI API? Because the official way in their
documentation to access it is by using Python. Lastly, why not use
pandas
in Python to tidy the data afterward? Because I
think the tidyverse
in R is much easier of a way to clean
data.
The great news: Posit’s RStudio IDE can handle both R and Python
(among many other languages). The use of the reticulate
R
package also means we can import Python functions directly into an R
session (and vice versa with rpy2
). These are all just
tools at the end of the day, so why not use the ones I’m comfortable,
quickest, and most experienced with?
The Functions
I started with two files: funs.R
and
funs.py
, which stored the functions I used.
funs.R
is for pulling the data from the Wikipedia
infobox, given the title and year of a film. I use this to search
Wikipedia, get the URL of first result from the search results, and then
scrape the infobox from that page:
#' Get the information box of a Wikipedia page
#'
#' Takes the title and year of a film, searches for it, gets the top result,
#' and pulls the information box at the top right of the page.
#'
#' @param title Title of the film
#' @param year Year the film was released
get_wikitext <- function(title, year) {
tryCatch({
tmp_tbl <- paste0(
"https://en.wikipedia.org/w/index.php?search=",
str_replace_all(title, " ", "+"),
"+",
year,
"+film"
) %>%
rvest::read_html() %>%
rvest::html_nodes(".mw-search-result-ns-0:nth-child(1) a") %>%
rvest::html_attr("href") %>%
paste0("https://en.wikipedia.org", .) %>%
rvest::read_html() %>%
rvest::html_node(".vevent") %>%
rvest::html_table() %>%
janitor::clean_names()
# just relevant rows
lgls <- grepl("Direct", tmp_tbl[[1]]) |
grepl("Screen", tmp_tbl[[1]]) |
grepl("Written", tmp_tbl[[1]]) |
grepl("Produce", tmp_tbl[[1]])
tmp_tbl <- tmp_tbl[lgls, ]
# clean up random css
# I have no idea how this works
# I just got it online
tmp_tbl[[2]] <- str_remove_all(tmp_tbl[[2]], "^.*?\\")
tmp_tbl[[2]] <- str_remove_all(tmp_tbl[[2]], "^\\..*?(?=\n)")
tmp_tbl[[2]] <- str_remove_all(tmp_tbl[[2]], "^.*?\\")
tmp_tbl[[2]] <- str_remove_all(tmp_tbl[[2]], "^\\..*?(?=\n)")
# print text
apply(tmp_tbl, 1, \(x) paste0(x[[1]], ": ", x[[2]])) %>%
paste(collapse = ", ") %>%
str_replace_all("\n", " ")
},
error = \(x) NA
)
}
An example output:
> get_wikitext("all that jazz", 1979)
[1] "Directed by: Bob Fosse, Written by: Robert Alan AurthurBob Fosse, Produced by: Robert Alan Aurthur"
Not perfect, but should be close enough. Sometimes it is closer, with
different formatting:
> get_wikitext("la la land", 2016)
[1] "Directed by: Damien Chazelle, Written by: Damien Chazelle, Produced by: Fred Berger Jordan Horowitz Gary Gilbert Marc Platt"
The result is then passed to the function defined in
funs.py
. That script is:
from openai import OpenAI
import ast
client = OpenAI(api_key='API_KEY_GOES_HERE')
def get_results(client, wikitext):
chat_completion = client.chat.completions.create(
messages=[
{
'role': 'user',
'content': '''
Below is a list that includes people involved with making a
movie. Each part corresponds to a different role that one might
have in making the movie (such as director, writer, or producer).
Could you tell me two things about the director? First,
did the director also write the script/screenplay/story for the
movie? And second, did the director also serve as a producer for
the movie? Note that, in this list, names may not be separated by
spaces even when they should be. That is, names may run together
at times. You do not need to provide any explanation. Please reply
with a valid Python dictionary, where: 'writer' is followed by
True if the director also wrote the film and False if they did
not, and 'producer' is followed by True if they also produced the
film and False if they did not. If you cannot determine, you can
follow it with NA instead of True or False. The information is:
''' + wikitext
}
],
model='gpt-3.5-turbo'
)
# tidy result to make readable dict
out = chat_completion.choices[0].message.content
out = out.replace('\n', '')
out = out.replace(' ', '')
out = out.replace('true', 'True')
out = out.replace('false', 'False')
return(ast.literal_eval(out))
(I don’t have as good of documentation here because I’m not as
familiar writing Python functions.)
Bringing It Together
I used an R script to use these functions in the same session. We
start off by loading the R packages, sourcing the R script, activating
the Python virtual environment (the path is relative to my file
structure in my drive), and sourcing the Python script. I read in the
data from a Google Sheet of mine and do one step of cleaning, as the
read_sheet()
function was bringing the title variable in as
a list of lists instead of a character vector.
library(tidyverse)
library(reticulate)
source("funs.R")
use_virtualenv("../../")
source_python("funs.py")
dat <- googlesheets4::read_sheet("SHEET_ID_GOES_HERE") %>%
mutate(film = as.character(film))
I then initialize two new variables in the data: writer
and producer
. These will get populated with
TRUE
if the director also served as a writer or producer,
respectively, and FALSE
otherwise.
res <- dat %>%
select(year, film) %>%
mutate(writer = NA, producer = NA)
I iterate through each row using a for
loop (I know this
isn’t a very tidyverse
way of doing things, as
map_*()
statements are preferred usually, but I felt it was
easiest for making sense of the code and catching errors).
for (r in 1:nrow(res)) {
cat(r, "\n")
tmp_wikitext <- get_wikitext(res$film[r], res$year[r])
# skip if get_wikitext fails
if (is.na(tmp_wikitext)) next
if (length(tmp_wikitext) == 0) next
# give the text to openai
tmp_chat <- tryCatch(
get_results(client, tmp_wikitext),
error = \(x) NA
)
# if openai returned a dict of 2
if (length(tmp_chat) == 2) {
res$writer[r] <- tmp_chat$writer
res$producer[r] <- tmp_chat$producer
}
}
I use cat()
to track progress. I use the function from
funs.R
to pull down the text I want GPT-3.5 to extract
information from. You’ll note that that function had a
tryCatch()
in it, because I didn’t want everything to stop
at an error. Upon an error, it’ll just return an NA
. I also
found that sometimes it would read a different page successfully but
then just return a blank character string. So if either of those are
true, I say next
to skip to the next row. This means I’m
not wasting OpenAI tokens feeding it blanks.
Then I use a Python function inside of an R session! I use
get_results()
, which was defined in funs.py
,
to take the text from Wikipedia and give it to OpenAI. If there was an
error, I again use tryCatch()
to give me an NA
instead of shutting the whole thing down. If there wasn’t an error, I
add the values to the res
data that I initialized above.
Notably, the
package knows that a Python dictionary should be brought in as a named
logical list.
What we can see from this script is you can seamlessly use R and
Python in one session, depending on the tools you have and what you’re
comfortable with. A clickbait topic in data science for the last ten
years or so has been “R or Python?” when really the answer is both: They
play quite nicely with one another, thanks to the hard work of
programmers who have developed packages like reticulate
and
Posit’s focus on languages beyond R.
Conclusion
You can use R and Python together smoothly
You can use the OpenAI API to efficiently do content coding for
your research and models
ALWAYS KEEP A HUMAN IN THE LOOP to check for
accuracy and fairness