surprise_dict <- dictionary(
list(
surprise = c("surprise", "wow", "suddenly", "bang")
)
)
surprise_dict
#> Dictionary object with 1 key entry.
#> - [surprise]:
#> - surprise, wow, suddenly, bang
In Chapter 13, we transformed our corpus into a DFM with counts of each word in each document. But not all words are created equal; some words are much more psychologically interesting than others. The simplest way to count relevant words while ignoring others is by using a dictionary.
This chapter introduces the basics of dictionary-based methodology. Chapter 15 and Chapter 16 will build on this chapter, exploring more advanced ways to use token counting for measurement.
A dictionary is a list of words (or other tokens) associated with a given psychological or other construct. For example, a dictionary for depression might include words like “sleepy” and “down.” We can use the dictionary to count construct-related words in each text—texts that use more construct-related words are then assumed to be more construct-related overall.
Let’s give a more concrete example: Recall that in the Hippocorpus data, the memType
variable indicates whether the participant was told to tell a story that happened to them recently (“recalled”), a story that they had already told a few months earlier (“retold”), or an entirely fictional story (“imagined”).
Sap et al. (2022) hypothesized that true autobiographical stories would include more surprising events than imagined stories. To test this hypothesis, we could use a dictionary of surprise-related words. Where could we find such a dictionary? Perhaps we could try making one up?
surprise_dict <- dictionary(
list(
surprise = c("surprise", "wow", "suddenly", "bang")
)
)
surprise_dict
#> Dictionary object with 1 key entry.
#> - [surprise]:
#> - surprise, wow, suddenly, bang
Generating a sentiment dictionary is not easy. Luckily, other researchers have done the work for us: The NRC Word-Emotion Association Lexicon (S. M. Mohammad & Turney, 2013; S. Mohammad & Turney, 2010), included in the quanteda.sentiment
package, has a list of 534 surprise words.
surprise_dict <- quanteda.sentiment::data_dictionary_NRC["surprise"]
surprise_dict
#> Dictionary object with 1 key entry.
#> Polarities: pos = "positive"; neg = "negative"
#> - [surprise]:
#> - abandonment, abduction, abrupt, accident, accidental, accidentally, accolade, advance, affront, aghast, alarm, alarming, alertness, alerts, allure, amaze, amazingly, ambush, angel, anomaly [ ... and 514 more ]
The NRC Word-Emotion Association Lexicon is a crowdsourced dictionary; S. M. Mohammad & Turney (2013) generated it by presenting individual words to thousands of online participants and asking them to rate how much each word is “associated with the emotion surprise.” The final dictionary includes all the words that were consistently reported to be at least moderately associated with surprise.
In Chapter 11, we emphasized the importance of reading through your data before conducting any analyses. The same is true for dictionaries: Before using any dictionary-based methods, always look through your dictionary and ask yourself two questions:
Let’s expand on each of these questions.
The surprise dictionary we are using was generated by asking participants how much each word was “associated with the emotion surprise” (S. M. Mohammad & Turney, 2013). A word can be “associated with” surprise because it reflects surprise (e.g. “suddenly”), but it can also be “associated with” surprise because it reflects the exact opposite of surprise. Indeed, if we look through the dictionary, we find words like “leisure” and “lovely”.
#> [1] "outburst" "godsend" "alarming" "intense" "lawsuit"
#> [6] "leisure" "scrimmage" "curiosity" "reappear" "placard"
#> [11] "diversion" "receiving" "thirst" "lovely" "frenetic"
#> [16] "perfection" "playground" "fearfully" "guess" "unfulfilled"
This means that we are not, in fact, measuring how surprising each story is. At best, we are measuring how much each story deals with surprise (or lack thereof) one way or another.
As you look through your dictionary, make sure you are aware of the process used to construct the dictionary. If it was generated by asking participants about individual words, how was the question formulated? How might that question have been interpreted by the participants?
The participants generating our dictionary were asked about one word at a time. People presented words out of context often fail to consider how words are actually used in natural discourse. For example, imagine that you are an online participant, and you are asked about your associations with the word “guess”. Seeing “guess” by itself might sound like an imperative, calling to mind a situation in which someone is asking you to guess something about which you are unsure—perhaps a game show. Since this sort of situation generally results in a surprise when the truth is revealed, you report that “guess” is associated with surprise. In fact, though, “guess” is much more frequently used in the phrase “I guess”, which signifies reluctance and has very little to do with surprise. We can check how “guess” is used our corpus by using Quanteda’s kwic()
function, which gives a dataframe of Key Words In Context (KIWC).
hippocorpus_tokens |>
kwic("guess") |>
mutate(text = paste(pre, keyword, post)) |>
pull(text)
#> [1] "his 30th birthday and I guess that's why he decided to"
#> [2] "healthier after a month I guess it was the stress of"
#> [3] "already made cake So i guess it wasn't that bad"
#> [4] "wrong Was she serious I guess so When I finished packing"
#> [5] "up our unit And I guess that's it I never saw"
#> [6] "I'm not sure yet I guess I will see how the"
#> [7] "FINALLY got admitted D I guess all those crazy contractions worked"
#> [8] "we made it safely I guess even the car got tired"
With the possible exception of #6, none of these examples give the impression of an impending surprise. Nevertheless, “guess” does appear in the NRC surprise dictionary.
As you look through your dictionary, think about how each word might really be used in context. Are there ways to use the word that do not have to do with your construct?
At this point, you might be pretty skeptical about using the NRC surprise dictionary to measure surprise. Even so, let’s try it out. To count how many times surprise words appear in each of our texts, we use the dfm_lookup()
function.
hippocorpus_surprise <- hippocorpus_dfm |>
dfm_lookup(surprise_dict)
hippocorpus_surprise
#> Document-feature matrix of: 6,854 documents, 1 feature (5.09% sparse) and 6 docvars.
#> features
#> docs surprise
#> 32RIADZISTQWI5XIVG5BN0VMYFRS4U 2
#> 3018Q3ZVOJCZJFDMPSFXATCQ4DARA2 0
#> 3IRIK4HM3B6UQBC0HI8Q5TBJZLEC61 4
#> 3018Q3ZVOJCZJFDMPSFXATCQG04RAI 3
#> 3MTMREQS4W44RBU8OMP3XSK8NMJAWZ 4
#> 3018Q3ZVOJCZJFDMPSFXATCQG06AR3 6
#> [ reached max_ndoc ... 6,848 more documents ]
Recall that we wanted to test whether true autobiographical stories include more surprise than imagined stories. Now that we have counted the number of surprise words in each document, how do we test our hypothesis?
A good first step is to reattach the word counts to our original corpus. As we do this, we convert both to dataframes.
hippocorpus_surprise_df <- hippocorpus_surprise |>
convert("data.frame") |> # convert to dataframe
right_join(
hippocorpus_corp |>
convert("data.frame") # convert to dataframe
)
It makes sense to control for the total number of words in each text, since longer texts have more opportunities to use surprise words1. To count the total number of tokens in each text, we can use the ntoken()
function on our DFM and add the result directly to the new dataframe.
hippocorpus_surprise_df <- hippocorpus_surprise_df |>
mutate(wc = ntoken(hippocorpus_dfm))
We are now ready for modeling! When your dependent variable is a count of words, we recommend using negative binomial regression, available in R with the MASS
package2. For extra sensitivity to the variable rates at which word frequencies grow with text length (see Baayen, 2001), we include wc
as a both a predictor and an offset offset(log(wc))
in the regression (an offset is just a predictor with its parameter at 1). We use log()
to account for the fact that negative binomial regression links the predictors with the outcome variable through a log link. This means that including offset(log(wc))
is equivalent to modeling the ratio of surprise words to total words (for a more detailed explanation of this dynamic, see the discussion here).
surprise_mod <- MASS::glm.nb(surprise ~ memType + wc + offset(log(wc)),
data = hippocorpus_surprise_df)
summary(surprise_mod)
#>
#> Call:
#> MASS::glm.nb(formula = surprise ~ memType + wc + offset(log(wc)),
#> data = hippocorpus_surprise_df, init.theta = 6.070929358,
#> link = log)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -3.9065113 0.0258623 -151.050 < 2e-16 ***
#> memTyperecalled -0.0324360 0.0176595 -1.837 0.06625 .
#> memTyperetold -0.0614152 0.0219399 -2.799 0.00512 **
#> wc -0.0008833 0.0000876 -10.082 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for Negative Binomial(6.0709) family taken to be 1)
#>
#> Null deviance: 7490.2 on 6853 degrees of freedom
#> Residual deviance: 7370.5 on 6850 degrees of freedom
#> AIC: 30997
#>
#> Number of Fisher Scoring iterations: 1
#>
#>
#> Theta: 6.071
#> Std. Err.: 0.270
#>
#> 2 x log-likelihood: -30987.333
Looking at the p-values for the coefficients, we see that there was no significant difference between recalled and imagined stories (p = 0.066). There was, however, a significant difference between retold and imagined stories, such that retold stories used fewer surprise words (p = 0.005).
An example of using raw word counts in research: Simchon et al. (2023) collected Twitter activity over a three month period from over 2.7 million users. Using a dictionary, they then counted the number of passive auxiliary verbs (e.g. “they were analyzed”; “my homework will be completed”) in each user’s activity. They found that users with more followers (indicating higher social status) used much fewer passive auxiliary verbs, controlling for total word count.
How can we improve our measurement of surprise? As we saw above, one problem with the dictionary approach is that a word might be associated with a construct because it reflects the opposite of that construct. One solution to this problem is to measure the ratio between the target dictionary and its opposite. In sentiment analysis, this approach is called polarity. Polarity is most commonly used to analyze the overall valence of a text by comparing positive words (e.g. “happy”, “great”) with negative words (e.g. “disappointed”, “terrible”). In principle though, we can use it to compare any sort of opposites.
What is the opposite of surprise? Plutchik (1962) argues that the opposite of surprise is anticipation. Luckily, the NRC Word-Emotion Association Lexicon also includes a dictionary of anticipation-associated words. Using this dictionary, we can measure how much a text is associated with surprise as opposed to anticipation.
Quanteda’s built-in function for polarity is textstat_polarity()
. To use this function, we first have to set the “positive” and “negative” polarities of the dictionary, and then call textstat_polarity()
on our DFM. By default, this outputs the log ratio of positive to negative counts for each document:
library(quanteda.sentiment)
#>
#> Attaching package: 'quanteda.sentiment'
#> The following object is masked from 'package:quanteda':
#>
#> data_dictionary_LSD2015
# subset dictionary
surprise_anticipation_dict <- data_dictionary_NRC[c("surprise", "anticipation")]
# set surprise and anticipation as polarity
polarity(surprise_anticipation_dict) <- list(pos = "surprise", neg = "anticipation")
# get polarity
hippocorpus_surprise_polarity <-
textstat_polarity(hippocorpus_dfm, surprise_anticipation_dict) |>
rename(surprise_vs_anticipation = sentiment)
While textstat_polarity()
can sometimes be useful for visualizations or downstream analyses, it is not helpful for modeling polarity as an outcome variable.
To test whether true autobiographical stories include more surprise relative to anticipation than imagined stories, we first count the surprise and anticipation words in each document, and rejoin the results to the full dataset.
# count surprise/anticipation words
hippocorpus_surprise_anticipation <- hippocorpus_dfm |>
dfm_lookup(surprise_anticipation_dict)
# convert to dataframe and join to full data
hippocorpus_surprise_anticipation_df <-
hippocorpus_surprise_anticipation |>
convert("data.frame") |>
right_join(
hippocorpus_corp |>
convert("data.frame"), # convert to dataframe
by = "doc_id"
) |>
mutate(wc = ntoken(hippocorpus_dfm))
Since we are still modelling word counts as an output, we again use negative binomial regression. Rather than controlling for the total word count, however, we can control for the total number of surprise words plus the number of anticipation words. Because of the log link function (along with the endlessly useful properties of logarithms) entering this sum as a log offset (offset(log(surprise + anticipation))
) is equivalent to modeling the ratio of surprise-related to anticipation-related words.
# remove zeros to prevent divide by zero errors
hippocorpus_surprise_anticipation_df <-
hippocorpus_surprise_anticipation_df |>
filter(surprise + anticipation > 0)
set.seed(2024)
surprise_anticipation_mod <- MASS::glm.nb(
surprise ~ memType + wc + offset(log(surprise + anticipation)),
data = hippocorpus_surprise_anticipation_df,
# increase iterations to ensure model converges
control = glm.control(maxit = 10000)
)
summary(surprise_anticipation_mod)
#>
#> Call:
#> MASS::glm.nb(formula = surprise ~ memType + wc + offset(log(surprise +
#> anticipation)), data = hippocorpus_surprise_anticipation_df,
#> control = glm.control(maxit = 10000), init.theta = 2.949221746e+17,
#> link = log)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.107e+00 1.990e-02 -55.659 <2e-16 ***
#> memTyperecalled -1.128e-02 1.356e-02 -0.831 0.406
#> memTyperetold -1.966e-02 1.697e-02 -1.158 0.247
#> wc -5.675e-05 6.462e-05 -0.878 0.380
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for Negative Binomial(2.949222e+17) family taken to be 1)
#>
#> Null deviance: 4884.6 on 6843 degrees of freedom
#> Residual deviance: 4882.1 on 6840 degrees of freedom
#> AIC: 10
#>
#> Number of Fisher Scoring iterations: 1
#>
#>
#> Theta: 2.949222e+17
#> Std. Err.: 6.158994e+14
#>
#> 2 x log-likelihood: 0
There is no significant difference between true and imagined stories in the ratio of surprise to anticipation words.
So far we have covered raw word counts, which use one list of words to represent a construct, and we have covered polarities, which use two lists of words to represent a construct and its opposite. The third and final dictionary-based method takes a more nuanced approach than either of these: In lexical norms, words are allowed to represent the construct or its opposite to continuously varying degrees, represented by numbers on a scale. In quanteda.sentiment
, this scale is called “valence”, though elsewhere it can be called “lexical affinity” or “lexical association”.
The same group that created the NRC Word-Emotion Association Lexicon also created a parallel dictionary with continuous scores: the NRC Hashtag Emotion Lexicon (S. M. Mohammad & Kiritchenko, 2015). Whereas the NRC Word-Emotion Association Lexicon was crowdsourced, the NRC Hashtag Emotion Lexicon was generated algorithmically from a corpus of Twitter posts which contained hashtags like “#anger” and “#surprise”. The dictionary includes the words that were most predictive of each hashtag, with scores indicating the strength of their statistical connection with the category (higher score indicates more representative). We can access the NRC Hashtag surprise dictionary from Github:
path <- "https://raw.githubusercontent.com/bwang482/emotionannotate/master/lexicons/NRC-Hashtag-Emotion-Lexicon-v0.2.txt"
hashtag <- read_tsv(path, col_names = c("emotion", "token", "score"))
#> Rows: 32389 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (2): emotion, token
#> dbl (1): score
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 6 × 3
#> emotion token score
#> <chr> <chr> <dbl>
#> 1 surprise yada 1.49
#> 2 surprise #preoccupied 1.49
#> 3 surprise jaden 1.49
#> 4 surprise #easilyamused 1.49
#> 5 surprise #needtofocus 1.49
#> 6 surprise #amazement 1.49
# Create dictionary
surprise_dict_hashtag <- dictionary(
list(surprise = hashtag$token[hashtag$emotion == "surprise"])
)
# Set dictionary valence
valence(surprise_dict_hashtag) <- list(
surprise = hashtag$score[hashtag$emotion == "surprise"]
)
To measure suprise in the Hippocorpus data, we find the suprise score of each token and compute the average score for the tokens of each document. With quanteda.sentiment
, we can do this by calling the textstat_valence()
function on our DFM. Since a score of zero in the NRC Hashtag Emotion Lexicon represents zero surprise, we will add normalization = "all"
to code non-dictionary words as zero by default.
# compute valence
hippocorpus_valence <- textstat_valence(
hippocorpus_dfm, # data
surprise_dict_hashtag, # dictionary
normalization = "all"
)
# rejoin to original data
hippocorpus_valence <- hippocorpus_valence |>
rename(surprise = sentiment) |>
right_join(
hippocorpus_corp |>
convert("data.frame") # convert to dataframe
)
Norm scores, unlike raw word counts and polarities, can be reasonably modeled using standard linear regression. Furthermore, because the score is an average rather than a sum or count, there is no need to control for total word count. Let’s test one more time whether true autobiographical stories include more surprise-related language than imagined stories:
#>
#> Call:
#> lm(formula = surprise ~ memType, data = hippocorpus_valence)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.085708 -0.015726 -0.000448 0.015093 0.104459
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.1402018 0.0004433 316.300 < 2e-16 ***
#> memTyperecalled 0.0029688 0.0006256 4.746 2.12e-06 ***
#> memTyperetold 0.0021648 0.0007791 2.779 0.00548 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.02327 on 6851 degrees of freedom
#> Multiple R-squared: 0.003406, Adjusted R-squared: 0.003116
#> F-statistic: 11.71 on 2 and 6851 DF, p-value: 8.388e-06
We found a significant difference between recalled and imagined stories (p < .001), such that recalled stories have more surprise-related language! This supports Sap et al.’s hypothesis that true autobiographical stories would include more surprising events than imagined stories. The new model also indicated a significant difference between retold and imagined stories, such that retold stories used more surprise-related language—the opposite direction relative to our original finding with the crowdsourced dictionary (p = 0.005).
So far we have seen the NRC Word-Emotion Association Lexicon, which used a crowdsourcing approach to generate the dictionary, and the NRC Hashtag Emotion Lexicon, which used a corpus-based approach, relying on hashtags for labeling. Crowdsourcing and algorithmic corpus-based generation are far from the only ways to generate a dictionary. Here we review various types of dictionaries and where to find them.
Besides the surprise dictionary, the NRC Word-Emotion Association Lexicon includes dictionaries for anger, fear, anticipation, trust, sadness, joy, and disgust. The same group has also produced other crowdsourced emotion dictionaries:
Psychologists have used crowdsourcing questionnaires to create dictionaries (especially norms) for decades. As such, crowdsourced dictionaries exist for many psychologically interesting constructs:
quanteda.sentiment
. The expanded norms are available as a zip file here.Words are used in many contexts, sometimes with many possible meanings. To take these into account, some groups rely on experts to generate their dictionaries. By far the most prominent collection of expert-generated dictionaries is LIWC (pronounced “Luke”), which includes word lists for grammatical patterns, emotional content, cognitive processes, and more. With its rigorous approach, LIWC has dominated the field of dictionary-based analysis in psychology for decades. The most recent version of LIWC (Boyd et al., 2022) was generated by a team of experts who went through numerous stages of brainstorming, voting, and reliability analysis before arriving at the final word lists.
Human raters are much better at judging full texts than individual words. Corpus-based dictionaries take advantage of this by extracting their word lists from corpora of full texts that have been rated by humans. We have already seen the NRC Hashtag Emotion Lexicon (S. M. Mohammad & Kiritchenko, 2015), which used Twitter hashtags to gather a corpus of Tweets labeled with emotions by their original authors. A more classic example of corpus-based dictionary generation is Rao et al. (2014), who used a corpus of 1,246 news headlines, each rated manually for anger, disgust, fear, joy, sad and surprise on a scale from 0 to 100 (Strapparava & Mihalcea, 2007). By correlating these ratings with frequencies of words (see Chapter 15), they extracted the words that were most representative of high ratings in each category. Araque et al. (2018) used a similar technique to create DepecheMood, which includes ratings for each word on eight emotional dimensions: afraid, amused, angry, annoyed, don’t care, happy, inspired, and sad. This base dictionary was updated with additional resources by Badaro et al. (2018) to create EmoWordNet, which can be accessed through the Internet Archive.
Many statistical techniques have been used to extract dictionaries from labeled corpora, some of which will be covered briefly in Chapter 15 and Chapter 18 of this book. For a recent review of methods, see Bandhakavi et al. (2021).
Thesaurus Mining: Strapparava & Valitutti (2004) started with a short list of strongly affect-related words (e.g. “anger”, “doubt”, “cry”), and used WordNet, a database of conceptual relations between words, to find close synonyms of the original words on the list. The result was WordNet Affect. Strapparava & Mihalcea (2007) used WordNet Affect to generate short lists of words associated with anger, disgust, fear, joy, sadness, and surprise, downloadable from here.
Decontextualized Embeddings: In Chapter 18, we will cover a family of methods for measuring the similarities between words based on how frequently they appear together in text: decontextualized embeddings. These methods can be used on their own for measuring psychological constructs, but they can also be used as a tool for building dictionaries. For example, Buechel et al. (2020) started with a small seed lexicon and used word embeddings (Section 18.3) to find other words that are likely to appear in texts of the same topic. The result—including dictionaries for valence, arousal, dominance, joy anger, sadness, fear, and disgust—is available for download online.
Combined Methods: Vegt et al. (2021) used a combination of expert input, thesaurus data from WordNet, word embeddings (Section 18.3), and crowdsourcing from online participants to generate norms for numerous constructs associated with grievance-fueled violence (e.g. desperation, fixation, frustration, hate, weapons). The final product is available here.
We use total word count here for the sake of the example, but total word count may not always be the appropriate measure of text length. For example, you may want to measure the amount of surprise relative to other emotional content. In this case, it would be more appropriate to control for the total number of emotion-related words, as opposed to the total word count. Similarly, if you were measuring the number of first person singular pronouns, you may want to control for the total number of pronouns rather than the total word count.↩︎
We use a simple count of words as the dependent variable here, but keep in mind that it may be more appropriate to apply a transformation such as Simple Good-Turing frequency estimation (Section 16.6).↩︎