5  Visualizing Distributions of Words

Natural language presents a special challenge for data visualization, because it is so complicated. The best visualizations simplify story by focusing on one or two words, as we saw with the line graph of “I” and “we” by age in Section 4.1. Another option for simplifying the story is to group words together into an overarching category, as we saw with the visualizations of “positive emotions” in Chapter 3. Such simplifications make better stories, but sometimes concessions must be made to the complexity of language. Viewers want to see all the words at once.

Normally, showing information about thousands of categories at the same time would be terribly confusing. But words are different, because everyone is an expert in natural language. The average native English speaker knows tens of thousands of words, along with nuanced associations between them (Hellman, 2011). This means that, with the right presentation, viewers can quickly take in many words and notice stories on their own.

Because visualizing words can be so foreign to those not familiar with natural language processing, we offer an brief tutorial on two methods: frequency/frequency plots and word clouds. Though we focus on words here, both of these methods can be applied to other types of tokens (see Chapter 13). Likewise, we focus here on frequency ratios as an intuitive metric for group comparisons. Nevertheless, these methods can be applied to more advanced metrics for comparing groups, as we will see in Section 15.2.

In this tutorial, we will visualize data from Buechel et al. (2018, see Section 4.2.1 and Section 4.2.4), in which participants rated their distress after reading various news stories, and described their thoughts in their own words. The distress ratings were then binned into two groups, allowing us to compare the content of “distressed” texts to that of “non-distressed” texts.

distressed_texts <- read_csv("https://raw.githubusercontent.com/wwbp/empathic_reactions/master/data/responses/data/messages.csv", show_col_types = FALSE) |> 
  select(essay, distress, distress_bin)

head(distressed_texts)
#> # A tibble: 6 × 3
#>   essay                                                    distress distress_bin
#>   <chr>                                                       <dbl>        <dbl>
#> 1 it is really diheartening to read about these immigrant…     4.38            1
#> 2 the phone lines from the suicide prevention line surged…     4.88            1
#> 3 no matter what your heritage, you should be able to ser…     3.5             0
#> 4 it is frightening to learn about all these shark attack…     5.25            1
#> 5 the eldest generation of russians aren't being treated …     4.62            1
#> 6 middle east is fucked up, I've honestly never heard of …     3.12            0

After some preprocessing, we begin with a dataframe in which each word has a row, with three variables:

The dataset has been filtered to only include “stop words” (covered in Chapter 16), words that we wouldn’t expect to be associated with the topic of the text.

head(distressed_texts_binary)
#> # A tibble: 6 × 4
#>   word  distressed_count nondistressed_count distressed_freq_ratio
#>   <chr>            <int>               <int>                 <dbl>
#> 1 the               3297                2985                 1.10 
#> 2 to                2556                2415                 1.06 
#> 3 and               2125                1856                 1.14 
#> 4 of                1592                1416                 1.12 
#> 5 i                 1587                1874                 0.847
#> 6 a                 1547                1603                 0.965

5.1 Frequency/Frequency Plots

A scatterplot is the most obvious choice for visualizing the relationship between two variables. For text data, this approach is commonly associated with the scattertext Python library (Kessler, 2017), but the same effect is easily accomplished in ggplot2.

Since we are comparing frequency in one group to frequency in another, we can put each frequency variable on an axis. We will call this a frequency/frequency plot, or F/F plot. To emphasize words that are more frequent in one group than in the other, we represent the ratio between the two frequencies with a diverging color scale.

library(ggrepel)

set.seed(2023)
distressed_texts_binary |> 
  ggplot(aes(nondistressed_count, distressed_count, 
             label = word,
             color = distressed_freq_ratio)) +
    geom_point() +
    geom_text_repel(max.overlaps = 20) +
    scale_x_continuous(trans = "log10", n.breaks = 5) +
    scale_y_continuous(trans = "log10", n.breaks = 6) +
    scale_color_gradient2(
      low = "blue4", 
      mid = "#E2E2E2", 
      high = "red4", 
      trans = "log2", # log scale for ratios
      limits = c(.25, 4), 
      breaks = c(.25, 1, 4),
      labels = c("Characteristically\nNon-Distressed",
                 "Equal Proportion",
                 "Characteristically\nDistressed")
      ) +
    labs(
      title = "Stop Words in Distressed and Non-Distressed Texts",
      x = "Occurrences in Non-Distressed Texts",
      y = "Occurrences in Distressed Texts",
      color = ""
      ) +
    coord_fixed() +
    theme_minimal()

This plot has the advantage of showing not just which words are characteristic of one group or the other, but also which are more common in both.

To allow viewers to explore these patterns in greater detail, we can make the plot interactive using the ggiraph package. Hover over the points to show the words they represent!

library(ggiraph, verbose = FALSE)
library(ggrepel)

set.seed(2023)
p <- distressed_texts_binary |> 
  ggplot(aes(nondistressed_count, distressed_count, 
             label = word,
             color = distressed_freq_ratio,
             tooltip = word, 
             data_id = word # aesthetics for interactivity
             )) +
    geom_point_interactive() +
    geom_text_repel_interactive() +
    scale_x_continuous(trans = "log10", n.breaks = 5) +
    scale_y_continuous(trans = "log10", n.breaks = 6) +
    scale_color_gradient2(
      low = "blue4", 
      mid = "#E2E2E2", 
      high = "red4", 
      trans = "log2", # log scale for ratios
      limits = c(.25, 4), 
      breaks = c(.25, 1, 4),
      labels = c("Characteristically\nNon-Distressed",
                 "Equal Proportion",
                 "Characteristically\nDistressed")
    ) +
    labs(
      title = "Stop Words in Distressed and Non-Distressed Texts",
      x = "Occurrences in Non-Distressed Texts",
      y = "Occurrences in Distressed Texts",
      color = ""
    ) +
    # fixed coordinates since x and y use the same units
    coord_fixed() + 
    theme_minimal()

girafe_options(
  girafe(ggobj = p),
  opts_tooltip(css = "font-family:sans-serif;font-size:1em;color:Black;")
  ) 

5.1.1 Rotated Frequency/Frequency Plots

A disadvantage of simple F/F plots: When people see a scatterplot, they think, “Aha! A correlation!” Any two samples of text in the same language will have highly correlated word frequencies. This boring story about the correlation is distracting from the more interesting stories about words that are especially characteristic of one group or another. This distraction can be removed by “rotating” the axes. Mathematically, we achieve this by plotting the average of the two frequencies (nondistressed_count + distressed_count)/2 on the y axis, and the ratio between the two frequencies on the x axis. The result is a much more intuitive plot with a clear binary comparison. Remember, sometimes you have to do something complicated to make something simple (Section 4.1).

library(ggiraph, verbose = FALSE)
library(ggrepel)

set.seed(2023)
p1 <- distressed_texts_binary |> 
  mutate(common = (nondistressed_count + distressed_count)/2) |> 
  ggplot(aes(distressed_freq_ratio, common, 
             label = word,
             color = distressed_freq_ratio,
             tooltip = word, data_id = word # aesthetics for interactivity
             )) +
    geom_point_interactive() +
    geom_text_repel_interactive() +
    scale_y_continuous(trans = "log2", breaks = ~.x,
                       minor_breaks = ~2^(seq(0,log2(.x[2]))),
                       labels = c("Rare", "Common")) +   
    scale_x_continuous(trans = "log2", limits = c(1/6,6),
                       breaks = c(.25, 1, 4),
                       labels = c("Characteristically\nNon-Distressed",
                                  "Equal Proportion",
                                  "Characteristically\nDistressed")) +
    scale_color_gradient2(low = "blue4", 
                          mid = "#E2E2E2", 
                          high = "red4", 
                          trans = "log2", # log scale for ratios
                          guide = "none") +
    labs(title = "Stop Words in Distressed and Non-Distressed Texts",
         x = "",
         y = "Average Frequency",
         color = "") +
    theme_minimal()

girafe_options(
  girafe(ggobj = p1),
  opts_tooltip(css = "font-family:sans-serif;font-size:1em;color:Black;")
  )

Because we love these rotated F/F plots so much, we couldn’t help showing off one more example, this time with data from the Corpus of Contemporary American English (Davies, 2009):

# get frequency data
httr::GET("https://www.wordfrequency.info/files/genres_sample.xls",
          httr::write_disk(tf <- tempfile(fileext = ".xls")))
word_freqs <- readxl::read_excel(tf) |> 
  select(lemma, ACADEMIC, SPOKEN)
p2 <- word_freqs |> 
  filter(ACADEMIC != 0, SPOKEN != 0) |> 
  # generate tooltip text
  mutate(rep = if_else(ACADEMIC/SPOKEN > 1, 
                       "more common in academic texts",
                       "more common in spoken texts"),
         mult = if_else(ACADEMIC/SPOKEN > 1, 
                        as.character(round(ACADEMIC/SPOKEN, 2)),
                        as.character(round(SPOKEN/ACADEMIC, 2))),
         tooltip = paste0("<b>",lemma, "</b>", "<br/>", 
                          mult, "x ", rep)) |> 
  ggplot(aes(ACADEMIC/SPOKEN, (ACADEMIC + SPOKEN)/2, 
             label = lemma,
             color = ACADEMIC/SPOKEN,
             tooltip = tooltip, 
             data_id = lemma # aesthetics for interactivity
             )) +
    geom_point_interactive() +
    scale_x_continuous(trans = "log2", 
                       breaks = c(1/100, 1, 100),
                          labels = c("Characteristically\nSpoken",
                                     "Equal Proportion",
                                     "Characteristically\nAcademic")) +
    scale_y_continuous(trans = "log2", 
                       breaks = ~.x, 
                       minor_breaks = ~2^(seq(0, log2(.x[2]))),
                       labels = c("Rare", "Common")) +
    scale_color_gradientn(limits = c(1/740, 740),
                          colors = c("#023903", 
                                     "#318232",
                                     "#E2E2E2", 
                                     "#9B59A7",
                                     "#492050"), 
                          trans = "log2", # log scale for ratios
                          guide = "none") +
    labs(title = "Academic vs. Spoken English",
         x = "", y = "",
         color = "") +
    theme_minimal()

girafe_options(
  girafe(ggobj = p2),
  opts_tooltip(css = "font-family:sans-serif;font-size:1em;color:Black;")
  )