15  Open Vocabulary Word Counting

In Chapter 14, we used a preexisting list of surprise-related words to test whether true autobiographical stories are more surprise-related than imagined stories. Sometimes though, it can be informative to take a theory-free approach to the difference between two groups. We might ask: How are the words used in true autobiographical stories different from the words used in imagined stories? We call the investigation of questions like this open vocabulary analyses, a term coined by Schwartz et al. (2013).

Both open vocabulary analyses and dictionary-based methods use word counts, but they come from opposite directions: Dictionary-based methods start with a construct and use it to examine the difference between groups (or levels of a continuous variable). The open vocabulary approach, on the other hand, starts with the difference between groups (or levels of a continuous variable) and generates a list of words that can then be identified with one or more constructs after the fact.

Open vocabulary analyses are useful for three types of applications:

15.1 Frequency Ratios

The most intuitive way to compare texts from two groups is one we already explored in the context of data visualization in Chapter 5: frequency ratios. To get frequency ratios from a Quanteda DFM, we can use the textstat_frequency() function from the quanteda.textstats package, with the groups parameter set to the categorical variable of interest. Let’s compare true stories from fictional ones in the Hippocorpus data.

library(quanteda.textstats)

imagined_vs_recalled <- hippocorpus_dfm |> 
  textstat_frequency(groups = memType)

head(imagined_vs_recalled)
#>   feature frequency rank docfreq    group
#> 1       i     32080    1    2686 imagined
#> 2     the     25826    2    2742 imagined
#> 3      to     24104    3    2746 imagined
#> 4     and     20847    4    2707 imagined
#> 5       a     16945    5    2714 imagined
#> 6     was     15405    6    2618 imagined

In the resulting dataframe, each row represents one feature within each category. frequency is the number of times the feature appears in the group, rank is the ordering from highest to lowest frequency within each group, and docfreq is the number of documents in the group in which the feature appears at least once. To compare imagined stories to recalled ones, we can calculate frequency ratios.

imagined_vs_recalled <- imagined_vs_recalled |> 
  filter(group %in% c("imagined", "recalled")) |> 
  pivot_wider(id_cols = "feature", 
              names_from = "group", 
              values_from = "frequency",
              names_prefix = "count_") |> 
  mutate(freq_imagined = count_imagined/sum(count_imagined, na.rm = TRUE),
         freq_recalled = count_recalled/sum(count_recalled, na.rm = TRUE),
         imagined_freq_ratio = freq_imagined/freq_recalled)

head(imagined_vs_recalled)
#> # A tibble: 6 × 6
#>   feature count_imagined count_recalled freq_imagined freq_recalled
#>   <chr>            <dbl>          <dbl>         <dbl>         <dbl>
#> 1 i                32080          32906        0.0475        0.0426
#> 2 the              25826          31517        0.0383        0.0408
#> 3 to               24104          27379        0.0357        0.0354
#> 4 and              20847          26447        0.0309        0.0342
#> 5 a                16945          19850        0.0251        0.0257
#> 6 was              15405          19484        0.0228        0.0252
#> # ℹ 1 more variable: imagined_freq_ratio <dbl>

We can now plot a rotated F/F plot, as in Chapter 5.

library(ggiraph, verbose = FALSE)
library(ggrepel)

set.seed(2023)
p <- imagined_vs_recalled |> 
  mutate(
    # calculate total frequency
    common = (freq_imagined + freq_recalled)/2,
    # remove single quotes (for html)
    feature = str_replace_all(feature, "'", "`")) |> 
  ggplot(aes(imagined_freq_ratio, common, 
             label = feature,
             color = imagined_freq_ratio,
             tooltip = feature, 
             data_id = feature
             )) +
    geom_point_interactive() +
    geom_text_repel_interactive(size = 2) +
    scale_y_continuous(
      trans = "log2", breaks = ~.x,
      minor_breaks = ~2^(seq(0, log2(.x[2]))),
      labels = c("Rare", "Common")
      ) +   
    scale_x_continuous(
      trans = "log10", limits = c(1/10,10),
      breaks = c(1/10, 1, 10),
      labels = c("10x More Common\nin Recalled Stories",
                 "Equal Proportion",
                 "10x More Common\nin Imagined Stories")
      ) +
    scale_color_gradientn(
      colors = c("#023903", 
                 "#318232",
                 "#E2E2E2", 
                 "#9B59A7",
                 "#492050"), 
      trans = "log2", # log scale for ratios
      guide = "none"
      ) +
    labs(
      title = "Words in Imagined and Recalled Stories",
      x = "",
      y = "Total Frequency",
      color = ""
    ) +
    # fixed coordinates since x and y use the same units
    coord_fixed(ratio = 1/8) + 
    theme_minimal()

girafe_options(
  girafe(ggobj = p),
  opts_tooltip(css = "font-family:sans-serif;font-size:1em;color:Black;")
  )