16  Transforming Word Counts

So far, we have used statistical methods that work with raw counts, like negative binomial regression in Section 14.3.1, and likelihood ratio testing in Section 15.2 (with the minor exception of Section 14.5, which used averaged scores). Raw token counts are difficult to work with: They are not normally distributed, they do not usually change linearly with predictors, and they are overly sensitive to quirks of linguistic style and text length.

Researchers and engineers have proposed various ways to fix these problems by transforming the raw counts or by transforming the text before performing the counting. In this chapter, we introduce a few of the most common transformations. These transformations can be applied to any analysis of word counts, including dictionary-based analysis (Chapter 14) and open vocabulary methods (Chapter 15), to bypass certain problems or bring out certain features of the data. As always, each technique has its own advantages and disadvantages.

16.1 Text Normalization

The simplest way to transform word counts is by transforming the text itself. We have already seen some simple examples of this in Section 13.2: removing punctuation, symbols, or URLs before tokenization. Such transformations are often called text normalization, since they get rid of the quirks in each text and ensure that everything follows a standard format.

16.1.1 Occurrence Thresholds

Besides removing or standardizing certain types of tokens, like URLs, researchers commonly enforce an occurrence threshold, removing any token that occurs less than a certain number of times in the data. Occurrence thresholds can be calculated on the full dataset (term frequency) or between documents (document frequency; e.g. removing tokens used in fewer than 1% of documents). Using a document frequency threshold is often beneficial, since sometimes a single document will use a token a lot—either because it happens to be discussing a specific topic or because of a quirk of the author’s language style—and drive up the overall frequency in a misleading way.

Occurrence thresholds can be performed on DFMs in Quanteda using the dfm_trim() function, with either a min_termfreq, a min_docfreq, or both. Maximum frequency thresholds can also be imposed.

# remove tokens used by fewer than 1% of documents
hippocorpus_dfm_threshold <- hippocorpus_dfm |> 
  dfm_trim(min_docfreq = 0.01, docfreq_type = "prop")
Advantages of Occurrence Thresholds
  • Cleaner Results: Occurence thresholds are an easy way to remove quirks of individuals’ writing styles, or very rare terms that complicate analysis without adding much information.
Disadvantages of Occurrence Thresholds
  • Arbitrary: Determining what threshold to use can be difficult, and runs the risk of excluding important information from the analysis.

16.1.2 Removing Stop Words

In natural language processing, a common step in text normalization is to remove “stop words,” everyday words like “the” and “of” that do not contribute much to the meaning of the text. Indeed, Quanteda offers a built-in list of stop words:

stopwords() |> head()
#> [1] "i"      "me"     "my"     "myself" "we"     "our"

Although removing stop words can be useful for analyzing the topics of texts, it is generally a bad idea when you are interested in the psychology of texts. This is because the forms in which people choose to write a word—including stop words—are often predictive of their personality. For example, neurotic people tend to use more first-person singulars (Mehl et al., 2006), and articles like “the” and “a” are highly predictive of males, being older, and openness (Schwartz et al., 2013).

The relationships between language and personality also extend to more subtle patterns. For example, extraverts tend to use longer words (Mehl et al., 2006), those high in openness tend to use more quotations (Sumner et al., 2011), and those high in neuroticism tend to use more acronyms (Holtgraves, 2011). So if you are looking for psychological differences, be gentle with the text normalization—you never know what strange predictors you might find.

Advantages of Removing Stop Words
  • Intuitive Appeal: Removing stop words focuses an analysis on content, rather than form. When people think of differences between texts, they generally think of differences in content.
Disadvantages of Removing Stop Words
  • Removes Important Information: While words like “the” and “a” may seem insignificant, they often carry important psychological information.

16.2 Binary (Boolean) Tokenization

In some cases, it makes sense to stop counting at one—each text either uses a given token or it does not. While this might seem like needlessly throwing away information, binary tokenization fixes a core problem with the bag of words assumption (BOW). Recall from Section 13.3 that BOW imagines that each author or topic has its characteristic bag of words, and speaking or writing is just a matter of pulling those words out of the bag one at a time at random. A central problem with this picture is that words are not pulled out one at a time at random—the word I am writing now is intimately tied to the words immediately before it. It may be very unlikely overall that I will write “parthenon,” but if I write it once, it is very likely that I will write it again in the same paragraph. This is because I am probably writing about the Parthenon.

The non-independence of words in text means that the difference between zero occurrences of “parthenon” and one occurrence is much more meaningful than the difference between one and two. If a particular token sometimes occurs lots of times in text, statistical procedures like regression may be led to focus on that variance rather than on the more interesting first occurrence. Binary tokenization is the simplest way to avoid this problem.

In Quanteda, a DFM can be converted to binary tokenization with dfm_weight(scheme = "boolean").

hippocorpus_dfm_binary <- hippocorpus_dfm |> 
  dfm_weight(scheme = "boolean")

print(hippocorpus_dfm_binary, max_ndoc = 6, max_nfeat = 6)
#> Document-feature matrix of: 6,854 documents, 27,667 features (99.50% sparse) and 6 docvars.
#>                                 features
#> docs                             concerts are my most favorite thing
#>   32RIADZISTQWI5XIVG5BN0VMYFRS4U        1   1  1    1        1     1
#>   3018Q3ZVOJCZJFDMPSFXATCQ4DARA2        0   0  1    0        0     0
#>   3IRIK4HM3B6UQBC0HI8Q5TBJZLEC61        0   0  1    1        0     1
#>   3018Q3ZVOJCZJFDMPSFXATCQG04RAI        0   1  1    0        0     0
#>   3MTMREQS4W44RBU8OMP3XSK8NMJAWZ        0   0  1    0        0     0
#>   3018Q3ZVOJCZJFDMPSFXATCQG06AR3        0   0  1    0        0     0
#> [ reached max_ndoc ... 6,848 more documents, reached max_nfeat ... 27,661 more features ]

To use binary tokenization in dictionary-based analysis, you can also perform the dictionary look-up first (Chapter 14) and then convert it to binary with mutate(surprise_binary = as.integer(surprise > 0)).

Keep in mind: Once you convert your DFM to binary tokenization, you are no longer working with a count variable. This means that the negative binomial regression models we covered in Chapter 14 are no longer appropriate. Instead, if you want to model the binary tokenization as a dependent variable, you will have to use a binary model like logistic regression.

Advantages of Binary Tokenization
  • Removes Non-Independence of Observations: Raw word counts can be misleading (to both humans and statistical models) because the observations are not independent; the more a text uses a word, the more likely it is to use that word again. Binary tokenization avoids this problem by only counting one event per text.
Disadvantages of Binary Tokenization
  • Devalues Common Tokens: With binary tokenization, you might miss differences in common tokens like “the.” Since almost every text uses “the” at least once, you won’t be able to detect that one group uses “the” more often than another.
  • Difficult to Control for Text Length: The longer a text is, the more likely any given word is to appear in it. But when we stop counting after the first word, this relationship becomes especially difficult to characterize. When working with shorter texts, beware of mistaking differences in text length for differences in the probability of a word appearing!

16.3 Relative Tokenization

The most common transformation that researchers apply to word counts is dividing them by the total number of words (or other tokens) in the text. The resulting ratio is referred to as a relative frequency. This strategy has intuitive appeal (e.g. what percentage of the words are surprise-related), and the value of intuitive appeal should not be discounted see 4.1. But working with ratios or percentages can cause problems with statistical analysis. This is because dividing by the total number of words is no guarantee that the total number of words will be properly controlled for—these are two separate variables. Regression analyses on relative frequencies are likely to give false positives when predictors are correlated with text length (Kronmal, 1993). This is the same reason why we added total word count as both a log offset and a regular predictor in Section 14.3.1.

Before you start worrying about how to control for text length, make sure to stop and ponder: Do you want to control for text length? Usually we assume that longer documents are longer because the author is more verbose. But what if longer texts are longer because they cover multiple topics, and one of those topics is what we are interested in? In this case, controlling for text length—especially using relative tokenization—will make it look like the longer texts have less of that topic, when in fact they just have more of other topics.

If you decide to use relative tokenization, the process is simple. For dictionary-based word counts, divide your count by the total word count to get a relative frequency. If you want to convert a full DFM to relative tokenization, you can use dfm_weight(scheme = "prop").

hippocorpus_dfm_relative <- hippocorpus_dfm |> 
  dfm_weight(scheme = "prop")

print(hippocorpus_dfm_relative, max_ndoc = 6, max_nfeat = 4)
#> Document-feature matrix of: 6,854 documents, 27,667 features (99.50% sparse) and 6 docvars.
#>                                 features
#> docs                                concerts         are         my        most
#>   32RIADZISTQWI5XIVG5BN0VMYFRS4U 0.004926108 0.004926108 0.02463054 0.004926108
#>   3018Q3ZVOJCZJFDMPSFXATCQ4DARA2 0           0           0.02732240 0          
#>   3IRIK4HM3B6UQBC0HI8Q5TBJZLEC61 0           0           0.03759398 0.011278195
#>   3018Q3ZVOJCZJFDMPSFXATCQG04RAI 0           0.006060606 0.02424242 0          
#>   3MTMREQS4W44RBU8OMP3XSK8NMJAWZ 0           0           0.03086420 0          
#>   3018Q3ZVOJCZJFDMPSFXATCQG06AR3 0           0           0.01257862 0          
#> [ reached max_ndoc ... 6,848 more documents, reached max_nfeat ... 27,663 more features ]

An example of relative tokenization in research: Golder & Macy (2011) collected messages from all Twitter user accounts (~2.4 million) created between February 2008 and April 2009, and measured positive and negative affect as proportion of in-dictionary (from LIWC) words to total word count. This calculation was done by hour of the day, day of the week, and month of the year, revealing fluctuations in mood in line with circadian rhythms and seasonal changes.

Advantages of Relative Tokenization
  • Intuitive Appeal: Relative tokenization makes sense if longer documents are longer because of verbosity, or if the construct of interest does not fluctuate over the course of a longer text (e.g. personality).
Disadvantages of Relative Tokenization
  • Discounts Longer Texts: If texts are long because they cover multiple topics (or multiple emotions), relative tokenization will dilute true occurrences of the construct of interest.
  • Does Not Control for Text Length: People often assume that using a percentage will control for the denominator in statistical analyses. This is wrong, which might make ratios like relative tokenization more trouble than they’re worth.
  • Not Normally Distributed: Dividing count variables by text length does not make them normally distributed. This can cause problems for certain statistical methods. Using the Anscombe transform can partially remedy these problems.

16.4 The Anscombe Transform

Word counts (whether divided by text length or not) are not normally distributed. In Chapter 14, we avoided this problem by using negative binomial regression. An alternative way to deal with this problem is to try to transform the counts to a normal distribution. Remember that complicated-looking formula from Schwartz et al. (2013) in Section 4.1? The upper part of that formula is relative tokenization. The lower part is called the Anscombe transform, and it transforms a Poisson distribution (i.e. a well-behaved count variable) into an approximately normal distribution. The transformed variable can then be used in linear regression. In R, the Anscombe transform can be written as 2*sqrt(count + 3/8).

The Anscombe transform can be useful for analyzing very large numbers of word count variables (as Schwartz et al. (2013) did) without having to run negative binomial regression each time. But if you are only analyzing a few variables, we recommend against it. This is because word counts do not follow a Poisson distribution and therefore will not be properly normally distributed even after the transform. The Poisson process assumes that words occur independently from each other, and with equal probability throughout a text. In other words, it makes the BOW assumption. As we’ve covered already, this assumption is problematic. In this case, the fact that words are not pulled randomly out of a bag makes word count distributions overdispersed, meaning that the variance of the distribution is higher than expected in a Poisson process. Negative binomial regression may be complicated and take longer to compute, but it tends to be robust to overdispersion.

Advantages of the Anscombe Transform
  • Computational Efficiency
  • Addresses Non-Normality of Word Frequencies
Disadvantages of the Anscombe Transform
  • Wrongly Assumes That Word Counts Follow a Poisson Distribution

16.5 TF-IDF

Term frequency-inverse document frequency (TF-IDF) is one of the great triumphs of NLP in the last century. It was first introduced by Sparck Jones (1972) and is still widely used half a century later, especially in search engines.

The idea of TF-IDF is to measure how topical a token is in a document In other words, how representative is that token of the particular features of that document as opposed to other documents? Let’s start with term frequency (TF). TF is just relative frequency - a word count divided by the total number of words in the document The problem with relative frequency is that it emphasizes common words that don’t tell us much about the meaning of the document—if a document has a high frequency of “the,” should we conclude that “the” is very important to the meaning of the document? Of course not. To fix this, we multiply TF by inverse document frequency (IDF). Document frequency is the proportion of documents that have at least one instance of our token (i.e. the average binary tokenization across documents). IDF is the log of the inverse of the document frequency. IDF answers the question: How unusual is it for this token to appear at all in a document? So “the,” which appears in almost every document, will have a very low IDF—it is not unusual at all. Even though “the” appears a lot in our document (TF is high), its TF-IDF score will be low, reflecting the fact that “the” doesn’t tell us much about the content of this particular document.

\[ TF \cdot IDF = relative\:frequency × \log{\frac{total\:documents}{documents\:with\:token}} \]

You might wonder: If we want to compare how common the token is in this document to how common it is in other documents, shouldn’t we use the inverse token frequency (i.e. the average relative frequency of tokens across documents in the corpus)? Why do we use the inverse document frequency? The answer: TF-IDF does not make the BOW assumption. If all documents were just bags of words, the frequency within documents would be distributed similarly to the frequency between documents (e.g. since the bigram “verbal reasoning” occurs in very few tweets, you would expect the probability of it occurring twice in the same document to be near-impossible). In reality though, it wouldn’t be surprising to see “verbal reasoning” even 3 times in the same tweet if that tweet were discussing verbal reasoning. Multiplying relative term frequency by inverse document frequency provides a measure of exactly how wrong BOW is in this case. In other words, how topical is the token?

In Quanteda, we can convert a DFM to TF-IDF with the dfm_tfidf() function:

hippocorpus_dfm_tfidf <- hippocorpus_dfm |> 
  dfm_tfidf(scheme_tf = "prop")

print(hippocorpus_dfm_tfidf, max_ndoc = 6, max_nfeat = 4)
#> Document-feature matrix of: 6,854 documents, 27,667 features (99.50% sparse) and 6 docvars.
#>                                 features
#> docs                               concerts         are           my
#>   32RIADZISTQWI5XIVG5BN0VMYFRS4U 0.01296465 0.002577006 0.0005541024
#>   3018Q3ZVOJCZJFDMPSFXATCQ4DARA2 0          0           0.0006146600
#>   3IRIK4HM3B6UQBC0HI8Q5TBJZLEC61 0          0           0.0008457352
#>   3018Q3ZVOJCZJFDMPSFXATCQG04RAI 0          0.003170499 0.0005453711
#>   3MTMREQS4W44RBU8OMP3XSK8NMJAWZ 0          0           0.0006943381
#>   3018Q3ZVOJCZJFDMPSFXATCQG06AR3 0          0           0.0002829755
#>                                 features
#> docs                                    most
#>   32RIADZISTQWI5XIVG5BN0VMYFRS4U 0.003125847
#>   3018Q3ZVOJCZJFDMPSFXATCQ4DARA2 0          
#>   3IRIK4HM3B6UQBC0HI8Q5TBJZLEC61 0.007156545
#>   3018Q3ZVOJCZJFDMPSFXATCQG04RAI 0          
#>   3MTMREQS4W44RBU8OMP3XSK8NMJAWZ 0          
#>   3018Q3ZVOJCZJFDMPSFXATCQG06AR3 0          
#> [ reached max_ndoc ... 6,848 more documents, reached max_nfeat ... 27,663 more features ]

Overall, TF-IDF combines the advantages of many simpler word count transformations without many of their downsides. For example, removing stop words focuses the analysis on text content, but may remove important information. TF-IDF discounts uninformative tokens like stop words without removing them outright. Likewise, we saw that binary tokenization solves part of the problem with the BOW assumption, but throws out a lot of information in the process. Once again, TF-IDF turns this problem into a feature (quantifying how topical a token is) without throwing out any information.

Despite all of its benefits, TF-IDF is not widely used in psychology research. This is for two reasons. First, it is difficult to interpret intuitively, and researchers prize interpretability. Second, it focuses the analysis on the topic of texts rather than the subtleties of language style that are often associated with psychological differences (see Section 16.1.2). Nevertheless, if your construct is of a less subtle nature—more akin to the “topic” of the text—consider using TF-IDF.

Advantages of TF-IDF
  • Computational Efficiency
  • Discounts Uninformative Tokens Without Losing Information: TF-IDF combines the advantages of stop word removal without removing tokens from the analysis.
  • Does Not Rely on BOW: TF-IDF leverages the discrepancy between term frequency and document frequency to discover patterns of meaning.
  • Proven to Work: TF-IDF has a long history of use in search engines and automated recommendations. It works surprisingly well for such a simple calculation.
Disadvantages of TF-IDF
  • May Discount Psychologically Relevant Informartion: TF-IDF operates on an assumption that more frequent tokens (across documents) are less relevant to the construct in question. For semantics this is largely true, but for latent psychological constructs it may not be.

16.6 Smoothing

In Section 16.3, we mentioned that dividing by text length does not adequately control for text length. This is true of ratios in general, but for word frequencies in particular there is extra cause for concern.

Consider this short text written by an imaginary participant, Allison:

I was sitting at the table, and suddenly I understood.

The text has 10 words in total. One of these, “suddenly,” is a surprise-related word. Given only this text, you might estimate the probability of surprise words in Allison’s language at 1/10, the relative frequency. You would be wrong. To see why, imagine that we also wanted to measure the probability of anger-related words. There are no anger-related words in this text. Is the probability of anger-related words then 0/10 = 0? Of course not. If we read more of Allison’s texts, we might very well encounter an anger-related word. Relative frequency therefore underestimates the probability of unobserved words. Conversely, it must overestimate the probability of observed words. So to leave room for the probability of unobserved words, we must admit that the true probability of anger-related words is likely to be a little less than 1/10.

The first solution to this kind of problem was offered by Laplace (1816), who proposed to simply add one to all the frequency counts before computing the relative frequency. This is called Laplace smoothing, or sometimes simply add-one smoothing. To perform Laplace smoothing in Quanteda, use the dfm_smooth() function to add one, and then call dfm_weight() as before.

hippocorpus_dfm_laplace <- hippocorpus_dfm |> 
  dfm_smooth(smoothing = 1) |> 
  dfm_weight(scheme = "prop")
#> Warning in asMethod(object): sparse->dense coercion: allocating vector of size
#> 1.4 GiB
print(hippocorpus_dfm_laplace, max_ndoc = 6, max_nfeat = 4)
#> Document-feature matrix of: 6,854 documents, 27,667 features (0.00% sparse) and 6 docvars.
#>                                 features
#> docs                                 concerts          are           my
#>   32RIADZISTQWI5XIVG5BN0VMYFRS4U 7.176175e-05 7.176175e-05 0.0002152853
#>   3018Q3ZVOJCZJFDMPSFXATCQ4DARA2 3.590664e-05 3.590664e-05 0.0002154399
#>   3IRIK4HM3B6UQBC0HI8Q5TBJZLEC61 3.579995e-05 3.579995e-05 0.0003937994
#>   3018Q3ZVOJCZJFDMPSFXATCQG04RAI 3.592986e-05 7.185973e-05 0.0001796493
#>   3MTMREQS4W44RBU8OMP3XSK8NMJAWZ 3.593374e-05 3.593374e-05 0.0002156024
#>   3018Q3ZVOJCZJFDMPSFXATCQG06AR3 3.573343e-05 3.573343e-05 0.0001786671
#>                                 features
#> docs                                     most
#>   32RIADZISTQWI5XIVG5BN0VMYFRS4U 7.176175e-05
#>   3018Q3ZVOJCZJFDMPSFXATCQ4DARA2 3.590664e-05
#>   3IRIK4HM3B6UQBC0HI8Q5TBJZLEC61 1.431998e-04
#>   3018Q3ZVOJCZJFDMPSFXATCQG04RAI 3.592986e-05
#>   3MTMREQS4W44RBU8OMP3XSK8NMJAWZ 3.593374e-05
#>   3018Q3ZVOJCZJFDMPSFXATCQG06AR3 3.573343e-05
#> [ reached max_ndoc ... 6,848 more documents, reached max_nfeat ... 27,663 more features ]

Laplace smoothing is better than nothing, but it is designed for a situation in which the number of possible token types is small and known. In the case of natural language, the number of possible tokens is extremely large and entirely unknown (Baayen, 2001). You can partially make up for this by adding a very small smoothing number instead of 1 (e.g. 1 divided by the number of token types in the corpus: smoothing = 1/ncol(hippocorpus_dfm)), but if you are willing to invest a bit more computation time, there are better ways.

During World War II, the German navy encrypted and decrypted its communications using the Enigma machine, which scrambled and unscrambled messages according to an initial input setting. This input setting was updated each day, and part of the process required the operator of the machine to select a three-letter sequence from a large book “at random” (Good, 2000). Alan Turing, who was in charge of the decryption effort on the part of the British, quickly understood that the German operators’ choices from this book were not entirely random, and that patterns in their choices could be exploited to narrow down the search. The problem became how to estimate the probability of each three-letter sequence based on a relatively small sample of previously decoded input settings—a sample much smaller than the number of three-letter sequences in the book (which, like the number of possible tokens in English text, was extremely large). Turing solved this problem—essentially the same problem posed by word frequencies in text—by developing a method that used frequencies of frequencies in the sample to estimate the total probability of yet-unseen sequences and correct the observed frequencies based on this estimate. The algorithm was later refined and published by Turing’s assistant I. J. Good (1953), and has since seen many variations. Today, the most popular of these variations is the Simple Good-Turing algorithm (Gale & Sampson, 1995). Unfortunately, Simple Good-Turing smoothing is not yet implemented in Quanteda, but that is expected to change in the coming months. As soon as it does, we will add it to the textbook.

While smoothing may seem unnecessarily complex, it can be an important safeguard against false positives when analyzing word counts or frequencies. This is because the bias that smoothing accounts for (due to unobserved tokens) is stronger for shorter texts, and grows in a non-linear pattern (Baayen, 2001). This bias can become a confounding factor in regression models, especially when a predictor variable is strongly correlated with text length.

Advantages of Smoothing
  • Better Estimates of Token Probabilities
  • Corrects Text-Length-Dependent Bias: This bias can cause false positive results in regression analyses.
Disadvantages of Smoothing
  • Computationally Inefficient: Laplace smoothing is much more efficient than Simple Good-Turing, but it does not perform as well.

16.7 Machine Learning Approaches

This chapter is dedicated to methods that can be carried out after computing word counts (Chapter 13) and before analysis (Chapter 14; Chapter 15). These methods can be thought of as transformations that we apply to word counts to make them better at measuring what we want them to measure. The last of these transformations is the most elaborate one: supervised machine learning.

For supervised machine learning, you need a training dataset of text that is already labeled with your construct of interest. We already saw a dataset like this for the emotion of surprise in Section 15.4: Crowdflower Emotion in Text dataset. Many other good examples can be found in Chapter 8. The labels could be generated by participants who read the texts, or by the authors of the texts themselves (e.g. in the form of questionnaires that measure their personality).

Once you have your training dataset, compute word counts for its texts (Chapter 13). These could be dictionary-based counts from a variety of different dictionaries, or they could be individual token counts in the form of a DFM. You can even transform these counts in some of the ways described in this chapter (or even more than one). These counts will become the predictor variables for your machine learning model.

This book is not the place an in-depth tutorial on machine learning1, but we will give a brief example. Recall that in Section 15.4 we used the Crowdflower Emotion in Text dataset to find the words most indicative of surprise in tweets (based on PMI). In Section 15.4 we used these words as a dictionary, but we could get more robust results by using them as predictor variables to train a regularized machine learning model, in this case ridge regression, although there are many other options.

# New Hippocorpus DFM 
hippocorpus_dfm_ngrams <- hippocorpus_corp |> 
  tokens(remove_punct = TRUE) |> 
  tokens_ngrams(n = c(1L, 2L)) |> # 1-grams and 2-grams
  dfm() |> 
  dfm_weight(scheme = "prop") # relative tokenization

# Select only high PMI tokens from Crowdflower
hippocorpus_dfm_surprise <- hippocorpus_dfm_ngrams |> 
  dfm_select(tweet_surprise_dict)

# Crowdflower DFM with only tokens that appear in Hippocorpus DFM
crowdflower_dfm <- crowdflower_dfm |> 
  dfm_weight(scheme = "prop") |>  # relative tokenization
  dfm_select(featnames(hippocorpus_dfm_surprise)) 

# Rejoin to Crowdflower DFM labeled data
crowdflower_train <- crowdflower |> 
  mutate(
    doc_id = as.character(tweet_id),
    label_surprise = as.integer(sentiment == "surprise")
    ) |> 
  select(doc_id, label_surprise) |> 
  left_join(
    convert(crowdflower_dfm, "data.frame"),
    by = "doc_id"
  ) |> 
  select(-doc_id)

# Balanced training dataset
# set.seed(2)
# crowdflower_train <- crowdflower_train |> 
#   group_by(label_surprise) |> 
#   slice_sample(n = sum(crowdflower_train$label_surprise==1))

# Ridge Regression (10-fold cross-validation)
library(caret)

tg <- expand.grid(
  alpha = 0, 
  lambda = c(2 ^ seq(-1, -6, length = 20))
  )

set.seed(2)
surprise_ridge <- train(
  label_surprise ~ ., 
  data = crowdflower_train, 
  method = "glmnet",
  tuneGrid = tg,
  trControl = trainControl("cv", number = 10)
  )

# Prepare Hippocorpus data to run model
hippocorpus_features <- hippocorpus_dfm_surprise |> 
  convert("data.frame") |> 
  left_join(
    convert(hippocorpus_corp, "data.frame"),
    by = "doc_id"
  )

# Run model on Hippocorpus for surprise estimation
surprise_pred <- predict(surprise_ridge, newdata = hippocorpus_features)
hippocorpus_features <- hippocorpus_features |> 
  mutate(surprise_pred = surprise_pred)

Now that we have a new machine-learning-powered estimate of surprise for the Hippocorpus data, we can retest our hypothesis that true autobiographical stories include more surprise than imagined stories.

surprise_mod_ml <- lm(surprise_pred ~ memType, 
                      data = hippocorpus_features)
summary(surprise_mod_ml)
#> 
#> Call:
#> lm(formula = surprise_pred ~ memType, data = hippocorpus_features)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.004668 -0.004625 -0.004251  0.002972  0.066572 
#> 
#> Coefficients:
#>                   Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)      5.735e-02  1.458e-04 393.268   <2e-16 ***
#> memTyperecalled -4.174e-04  2.058e-04  -2.028   0.0426 *  
#> memTyperetold   -4.367e-05  2.563e-04  -0.170   0.8647    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.007655 on 6851 degrees of freedom
#> Multiple R-squared:  0.0006732,  Adjusted R-squared:  0.0003814 
#> F-statistic: 2.308 on 2 and 6851 DF,  p-value: 0.09958

We find that imagined stories have significantly less surprise-related language than autobiographical stories (p = 0.043)!

Despite the exciting result, we should be careful with this newfangled approach. As with dictionary-based methods, beware of problems with generalization—there is no guarantee that surprise in Tweets will look similar to surprise in autobiographical accounts. Likewise, keep in mind all of the regular challenges of machine learning. Notice for example that the intercept of this model is extremely low (0.057; surprise is measured between 0 and 1). This reflects the fact that the Crowdflower dataset is not balanced; there are very few Tweets labeled with surprise relative to the size of the dataset.

An example of machine learning approaches in research: Zamani et al. (2018) extracted n-grams from Facebook status updates. They then computed TF-IDF scores, binary tokenization, and LDA topics (Chapter 21), subjected all of these values to a dimensionality reduction process to reduce overfitting, and used the resulting features to train a ridge regression model to predict questionnaire-based measures of trustfulness. This regression model could then be used on novel texts to estimate the trustworthiness of their authors.

Advantages of Machine Learning Approaches
  • Accuracy
  • Regularization: Machine learning algorithms focus on mitigating the influence of outliers. This can sometimes help generalize across datasets too.
  • Avoid Statistical Troubles: The output of machine learning models is often continuous and more or less normally distributed. This means standard linear regression is usually sufficient for hypothesis testing.
Disadvantages of Machine Learning Approaches
  • Require a Relevant Training Dataset
  • Difficult to Interpret
  • May Fail to Generalize Across Datasets

Baayen, R. H. (2001). Word frequency distributions. Springer Netherlands. https://link.springer.com/book/10.1007/978-94-010-0844-0
Gale, W. A., & Sampson, G. (1995). Good‐turing frequency estimation without tears. Journal of Quantitative Linguistics, 2(3), 217–237. https://doi.org/10.1080/09296179508590051
Giuntini, F. T., Cazzolato, M. T., Reis, M. de J. D. dos, Campbell, A. T., Traina, A. J. M., & Ueyama, J. (2020). A review on recognizing depression in social networks: Challenges and opportunities. Journal of Ambient Intelligence and Humanized Computing, 11(11), 4713–4729.
Good, I. J. (1953). The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika, 40(3/4), 237–264. https://doi.org/10.2307/2333344
Good, I. J. (2000). Turing’s anticipation of empirical bayes in connection with the cryptanalysis of the naval enigma. Journal of Statistical Computation and Simulation, 66(2), 101–111. https://doi.org/10.1080/00949650008812016
Holtgraves, T. (2011). Text messaging, personality, and the social context. Journal of Research in Personality, 45(1), 92–99. https://doi.org/https://doi.org/10.1016/j.jrp.2010.11.015
Kronmal, R. A. (1993). Spurious correlation and the fallacy of the ratio standard revisited. Journal of the Royal Statistical Society Series A, 156(3), 379–392. https://www.jstor.org/stable/2983064
Laplace, P. S. de. (1816). Essai philosophique sur les probabilités; par M. Le comte Laplace .. M.me v.e Courcier, impr.-libr. pour les mathématiques et la marine, quai des Augustins, no 57. http://archive.org/details/bub_gb_vVzdR0tuoWAC
Mehl, M. R., Gosling, S. D., & Pennebaker, J. W. (2006). Personality in its natural habitat: Manifestations and implicit folk theories of personality in daily life. Journal of Personality and Social Psychology, 90 5, 862–877. https://api.semanticscholar.org/CorpusID:2932332
Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Ramones, S. M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M. E. P., & Ungar, L. H. (2013). Personality, gender, and age in the language of social media: The open-vocabulary approach. PLOS ONE, 8(9), 1–16. https://doi.org/10.1371/journal.pone.0073791
Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11–21. https://doi.org/10.1108/eb026526
Sumner, C., Byers, A., & Shearing, M. (2011). Determining personality traits & privacy concerns from facebook activity. Black Hat Brief, 11.
Zamani, M., Buffone, A., & Schwartz, H. A. (2018). Predicting human trustfulness from Facebook language. In K. Loveys, K. Niederhoffer, E. Prud’hommeaux, R. Resnik, & P. Resnik (Eds.), Proceedings of the fifth workshop on computational linguistics and clinical psychology: From keyboard to clinic (pp. 174–181). Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-0619

  1. For a good overview of different methods, see Giuntini et al. (2020)↩︎