Introduction

Last time, I did some basic frequency analyses of Kanye West lyrics. I've been using the tidytext package at work a lot recently, and I thought I would apply some of the package's sentiment dictionaries to the Kanye West corpus I have already made (discussed here). The corpus is not big enough to do the analyses by song or by a very specific emotion (such as joy), so I will stick to tracking positive and negative sentiment of Kanye's lyrical content over the course of his career.

Method

For each song, I removed duplicate words (for reasons like so that the song “Amazing” doesn't have an undue influence on the analyses, given that he says “amazing"—a positive word—about 50 times). I allowed duplicate words from the same album, but not from the same song.

The tidytext package includes three different sentiment analysis dictionaries. One of the dictionaries assigns each word a score from -5 (very negative) to +5 (very positive). Using this dictionary, I simply took the average score for the album:

afinn <- kanye %>% # starting with full data set
  unnest_tokens(word, lyrics) %>% # what was long string of text, now becomes one word per row
  inner_join(get_sentiments("afinn"), by="word") %>% # joining sentiments with these words
  group_by(song) %>%  # grouping dataset by song
  subset(!duplicated(word)) %>% # dropping duplicates (by song, since we are grouped by song)
  ungroup() %>% # ungrouping real quick...
  group_by(album) %>% # and now grouping by album
  mutate(sentiment=mean(score)) %>%  # getting mean by album
  slice(1) %>% # taking one row per every album
  ungroup() %>% # ungrouping
  subset(select=c(album,sentiment)) # only including the album and sentiment columns

The other two dictionaries tag each word as negative or positive. For these, I tallied up how many positive and negative words were being used per album and subtracted the negative count from the positive count:

bing <- kanye %>% # taking data set
  unnest_tokens(word, lyrics) %>% # making long list of lyrics into one word per row
  inner_join(get_sentiments("bing"), by="word") %>% # joining words with sentiment
  group_by(song) %>% # grouping by song
  subset(!duplicated(word)) %>% # getting rid of duplicated words (within each song)
  ungroup() %>% # ungrouping
  count(album, sentiment) %>% # counting up negative and positive sentiments per album
  spread(sentiment, n, fill=0) %>% # putting negative and positive into different columns
  mutate(sentiment=positive-negative) # subtracting negative from positive

# all of this is same as above, but with a different dictionary
nrc <- kanye %>% 
  unnest_tokens(word, lyrics) %>% 
  inner_join(get_sentiments("nrc"), by="word") %>%
  group_by(song) %>% 
  subset(!duplicated(word)) %>% 
  ungroup() %>% 
  count(album, sentiment) %>%
  spread(sentiment, n, fill=0) %>% 
  mutate(sentiment=positive-negative)

I then merged all these data frames together, standardized the sentiment scores, and averaged the three z-scores together to get an overall sentiment rating for each album:

colnames(afinn)[2] <- "sentiment_afinn" # renaming column will make it easier upon joining

bing <- ungroup(subset(bing, select=c(album,sentiment))) # getting rid of unnecessary columns
colnames(bing)[2] <- "sentiment_bing"

nrc <- ungroup(subset(nrc, select=c(album,sentiment)))
colnames(nrc)[2] <- "sentiment_nrc"

suppressMessages(library(plyr)) # getting plyr temporarily because I like join_all()
album_sent <- join_all(list(afinn, bing, nrc), by="album") # joining all three datasets by album
suppressMessages(detach("package:plyr")) # getting rid of plyr, because it gets in the way of dplyr

# creating composite sentiment score:
album_sent$sent <- (scale(album_sent$sentiment_afinn) + 
                    scale(album_sent$sentiment_bing) + 
                    scale(album_sent$sentiment_nrc))/3
album_sent <- album_sent[-7,c(1,5)] # subsetting data to not include watch throne and only include composite
# reordering the albums in chronological order
album_sent$album <- factor(album_sent$album, levels=levels(album_sent$album)[c(2,4,3,1,5,7,8,6)])

Results

And then I plotted the sentiment scores by album, in chronological order. Higher scores represent more positive lyrics:

ggplot(album_sent, aes(x=album, y=sent))+
  geom_point()+
  geom_line(group=1, linetype="longdash", size=.8, alpha=.5)+
  labs(x="Album", y="Sentiment")+
  theme_fivethirtyeight()+
  scale_x_discrete(labels=c("The College Dropout", "Late Registration", "Graduation",
                            "808s & Heartbreak", "MBDTF", "Yeezus", "The Life of Pablo"))+
  theme(axis.text.x=element_text(angle=45, hjust=1),
        text = element_text(size=12))

plot of chunk unnamed-chunk-32

Looks like a drop in happiness after his mother passed as well as after starting to date Kim Kardashian…

This plot jibes with what we know: 808s is the "sad” album. Graduation—with all the talk about achievement, being strong, making hit records, and having a big ego—peaks at the album with the most positive words. This plot seems to lend some validity to the method of analyzing sentiments using all three dictionaries from the tidytext package.

Some issues with this “bag-of-words” approach, however, was that it was prone to error in the finer level of analysis, like song. “Street Lights,” perhaps one of Kanye's saddest songs, came out with one of the most positive scores. Why? It was reading words like “fair” as positive, neglecting to realize that a negation word (i.e., “not”) always preceded it. One could get around this with maybe n-grams or natural language processing.

Nevertheless, there's the trajectory of Kanye's lyrical positivity over time!

Blog

Introduction

Method

Results

Archive