I wrote a post last year looking at how to employ tools in LangChain to have GPT-3.5 Turbo access information on the web, outside of its training data.
The purpose of the present post is to revisit this post, improving the poor performance I saw there through refactoring and prompt engineering.
Background
The motivating example is again using large language models (LLMs) to help me calculate features for my Oscar model. Specifically: How many films did the director of a Best Picture nominee direct before the nominated film?
Read More
I wrote a post in July 2023 describing my process for building a supervised text classification pipeline. In short, the process first involves reading the text, writing a thematic content coding guide, and having humans label text. Then, I define a variety of ways to pre-process text (e.g., word vs. word-and-bigram tokenizing, stemming vs. not, stop words vs. not, filtering on the number of times a word had to appear in the corpus) in a workflowset. Then, I run these different pre-processors through different standard models: elastic net, XGBoost, random forest, etc. Each class of text has its own model, so I would run this pipeline five times if there were five topics in the text. Importantly, this is not natural language processing (NLP), as it was a bag-of-words approach.
The idea was to leverage the domain knowledge of the experts on my team through content coding, and then scaling it up using a machine learning pipeline. In the post, I bemoaned how most of the “NLP” or “AI-driven” tools I had tested did not do very well. The tools I was thinking of were all web-based, point-and-click applications that I had tried out since about 2018, and they usually were unsupervised.
We are in a wildly different environment now when it comes to analyzing text than we were even a few years ago. I am revisiting that post to explore alternate routes to classifying text. I will use the same data as I did in that post: 720 Letterboxd reviews of Wes Anderson’s film Asteroid City. There is only one code: Did the review discuss Wes Anderson’s unique visual style (1) or not (0)? I hand-labeled all of these on one afternoon to give me a supervised dataset to play with.
Read More
January 12th, 2025
The Directors Guild of America (DGA) award for Outstanding Directing - Feature Film is the most important predictor of who will win Best Picture at the Academy Awards. And it’s the most important by a lot. Really, this whole exercise I’m doing is about learning: How can I predict when the Academy and the DGA will disagree? The winner of the DGA’s top award has gone on to win Best Picture 75% (n = 57) times.
The DGA award wasn’t announced this past week… but the nominees were. An additional 21% (n = 16) of films nominated for DGA’s Outstanding Directing went on to win Best Picture.
That means there has only been three times a Best Picture winner was not nominated by the DGA, being Hamlet (1948), Driving Miss Daisy (1989), and CODA (2021).
Read More