R and Python Together: Refactoring and Prompt Engineering A Previous Case Study, Using the Perplexity API

I wrote a post last year looking at how to employ tools in LangChain to have GPT-3.5 Turbo access information on the web, outside of its training data.

The purpose of the present post is to revisit this post, improving the poor performance I saw there through refactoring and prompt engineering.

Background

The motivating example is again using large language models (LLMs) to help me calculate features for my Oscar model. Specifically: How many films did the director of a Best Picture nominee direct before the nominated film?

Read More

Rethinking How I Do Supervised Topic Modeling, Using ModernBERT and GPT-4o mini

I wrote a post in July 2023 describing my process for building a supervised text classification pipeline. In short, the process first involves reading the text, writing a thematic content coding guide, and having humans label text. Then, I define a variety of ways to pre-process text (e.g., word vs. word-and-bigram tokenizing, stemming vs. not, stop words vs. not, filtering on the number of times a word had to appear in the corpus) in a workflowset. Then, I run these different pre-processors through different standard models: elastic net, XGBoost, random forest, etc. Each class of text has its own model, so I would run this pipeline five times if there were five topics in the text. Importantly, this is not natural language processing (NLP), as it was a bag-of-words approach.

The idea was to leverage the domain knowledge of the experts on my team through content coding, and then scaling it up using a machine learning pipeline. In the post, I bemoaned how most of the “NLP” or “AI-driven” tools I had tested did not do very well. The tools I was thinking of were all web-based, point-and-click applications that I had tried out since about 2018, and they usually were unsupervised.

We are in a wildly different environment now when it comes to analyzing text than we were even a few years ago. I am revisiting that post to explore alternate routes to classifying text. I will use the same data as I did in that post: 720 Letterboxd reviews of Wes Anderson’s film Asteroid City. There is only one code: Did the review discuss Wes Anderson’s unique visual style (1) or not (0)? I hand-labeled all of these on one afternoon to give me a supervised dataset to play with.

Read More

Predicting Best Picture at the 2025 Academy Awards

January 26th, 2025

The nominations are finally here, and I have reduced the list from 20 down to 10. There was one big surprise: I’m Still Here received a nomination for Best Picture. This was not on my list of 20 possible nominees—only one of the six sources I used to create the list of 20 possible contenders mentioned it back in late November. I added it and took out the other 11. In addition to this narrowing the field, the model also includes down-ballot nominations for the Academy Awards.

The picture we see is a three-film race: Emilia Pérez, The Brutalist, and Anora. These are the only three movies with a forecasted chance of winning higher than 1%, and they’re all above 20% likely. However, we have no clear leader that passes the 50% likely threshold.

Read More