In Chapter 14, we used a preexisting list of surprise-related words to test whether true autobiographical stories are more surprise-related than imagined stories. Sometimes though, it can be informative to take a theory-free approach to the difference between two groups. We might ask: How are the words used in true autobiographical stories different from the words used in imagined stories? We call the investigation of questions like this open vocabulary analyses, a term coined by Schwartz et al. (2013).
Both open vocabulary analyses and dictionary-based methods use word counts, but they come from opposite directions: Dictionary-based methods start with a construct and use it to examine the difference between groups (or levels of a continuous variable). The open vocabulary approach, on the other hand, starts with the difference between groups (or levels of a continuous variable) and generates a list of words that can then be identified with one or more constructs after the fact.
Open vocabulary analyses are useful for three types of applications:
Open vocabulary analyses are sometimes a good second step in EDA, after looking at your data directly (see Chapter 11). If you are planning an analysis of token counts (e.g. using dictionary-based methods), open vocabulary analyses are a good way to look for overall patterns in the way that tokens are distributed across your groups, or to look for individual tokens that are particularly representative of one group or another.
When training a machine learning model on token counts (Section 16.7), you will need to decide which variables (AKA features) to include in the model. You can use an open vocabulary approach to find the tokens that carry the most information about your outcome variable.
Open vocabulary analyses can be a final product! In some cases, researchers want to characterize difference between the language use of two groups, without being constrained by particular dictionaries. Schwartz et al. (2013) provide many elegant examples of this approach.
15.1 Frequency Ratios
The most intuitive way to compare texts from two groups is one we already explored in the context of data visualization in Chapter 5: frequency ratios. To get frequency ratios from a Quanteda DFM, we can use the textstat_frequency() function from the quanteda.textstats package, with the groups parameter set to the categorical variable of interest. Let’s compare true stories from fictional ones in the Hippocorpus data.
#> feature frequency rank docfreq group
#> 1 i 32080 1 2686 imagined
#> 2 the 25826 2 2742 imagined
#> 3 to 24104 3 2746 imagined
#> 4 and 20847 4 2707 imagined
#> 5 a 16945 5 2714 imagined
#> 6 was 15405 6 2618 imagined
In the resulting dataframe, each row represents one feature within each category. frequency is the number of times the feature appears in the group, rank is the ordering from highest to lowest frequency within each group, and docfreq is the number of documents in the group in which the feature appears at least once. To compare imagined stories to recalled ones, we can calculate frequency ratios.
#> # A tibble: 6 × 6
#> feature count_imagined count_recalled freq_imagined freq_recalled
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 i 32080 32906 0.0475 0.0426
#> 2 the 25826 31517 0.0383 0.0408
#> 3 to 24104 27379 0.0357 0.0354
#> 4 and 20847 26447 0.0309 0.0342
#> 5 a 16945 19850 0.0251 0.0257
#> 6 was 15405 19484 0.0228 0.0252
#> # ℹ 1 more variable: imagined_freq_ratio <dbl>
We can now plot a rotated F/F plot, as in Chapter 5.
library(ggiraph, verbose =FALSE)library(ggrepel)set.seed(2023)p<-imagined_vs_recalled|>mutate(# calculate total frequency common =(freq_imagined+freq_recalled)/2,# remove single quotes (for html) feature =str_replace_all(feature, "'", "`"))|>ggplot(aes(imagined_freq_ratio, common, label =feature, color =imagined_freq_ratio, tooltip =feature, data_id =feature))+geom_point_interactive()+geom_text_repel_interactive(size =2)+scale_y_continuous( trans ="log2", breaks =~.x, minor_breaks =~2^(seq(0, log2(.x[2]))), labels =c("Rare", "Common"))+scale_x_continuous( trans ="log10", limits =c(1/10,10), breaks =c(1/10, 1, 10), labels =c("10x More Common\nin Recalled Stories","Equal Proportion","10x More Common\nin Imagined Stories"))+scale_color_gradientn( colors =c("#023903", "#318232","#E2E2E2", "#9B59A7","#492050"), trans ="log2", # log scale for ratios guide ="none")+labs( title ="Words in Imagined and Recalled Stories", x ="", y ="Total Frequency", color ="")+# fixed coordinates since x and y use the same unitscoord_fixed(ratio =1/8)+theme_minimal()girafe_options(girafe(ggobj =p),opts_tooltip(css ="font-family:sans-serif;font-size:1em;color:Black;"))