12  Introduction to Quanteda

For all analyses involving separating text into smaller units (e.g. words), this book will use the Quanteda family of packages. While it lacks some of the conceptual elegance of the tidyverse, Quanteda is indispensable because of its scope and efficiency—Quanteda functions are faster to write, faster to run, and applicable to a wider variety of uses than any other text analysis framework in R. Beyond the base quanteda package, Quanteda offers a plethora of specialized extensions, for example:

Quanteda is also well documented. See the Quanteda tutorials webpage for details on file formats and methods not covered in this book.

The home-base of any Quanteda analysis is the corpus. A corpus is a static container holding a library of text documents and associated properties of those documents, called docvars. Later, when we apply complex transformations to our texts (such as splitting them into words), we will need to first store them as a corpus.

You can create corpora from a variety of data formats, but we will begin with a dataframe containing our text variable, story.

hippocorpus_df <- read_csv("data/hippocorpus-u20220112/hcV3-stories.csv") |> 
  select(AssignmentId, story, memType, summary, WorkerId, 
         annotatorGender, openness, timeSinceEvent)
#> Rows: 6854 Columns: 23
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (12): AssignmentId, WorkerId, annotatorGender, annotatorRace, mainEvent,...
#> dbl (11): WorkTimeInSeconds, annotatorAge, distracted, draining, frequency, ...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(hippocorpus_df)
#> # A tibble: 6 × 8
#>   AssignmentId           story memType summary WorkerId annotatorGender openness
#>   <chr>                  <chr> <chr>   <chr>   <chr>    <chr>              <dbl>
#> 1 32RIADZISTQWI5XIVG5BN… Conc… imagin… My boy… E9TY34YY man                 0   
#> 2 3018Q3ZVOJCZJFDMPSFXA… The … recall… My boy… 237K2NI1 woman               1   
#> 3 3IRIK4HM3B6UQBC0HI8Q5… It s… imagin… My sis… FK5QTANB woman               0.5 
#> 4 3018Q3ZVOJCZJFDMPSFXA… Five… recall… My sis… UYOSBBRS woman               1   
#> 5 3MTMREQS4W44RBU8OMP3X… Abou… imagin… It is … 34BFLNJV man                 0.25
#> 6 3018Q3ZVOJCZJFDMPSFXA… Burn… recall… It is … L427B0E0 woman               1   
#> # ℹ 1 more variable: timeSinceEvent <dbl>

To turn this into a corpus with the corpus() constructor, specify the text variable and a unique identifier by name. All other variables will automatically become docvars.

hippocorpus_corp <- corpus(hippocorpus_df, 
                           docid_field = "AssignmentId", 
                           text_field = "story")
hippocorpus_corp
#> Corpus consisting of 6,854 documents and 6 docvars.
#> 32RIADZISTQWI5XIVG5BN0VMYFRS4U :
#> "Concerts are my most favorite thing, and my boyfriend knew i..."
#> 
#> 3018Q3ZVOJCZJFDMPSFXATCQ4DARA2 :
#> "The day started perfectly, with a great drive up to Denver f..."
#> 
#> 3IRIK4HM3B6UQBC0HI8Q5TBJZLEC61 :
#> "It seems just like yesterday but today makes five months ago..."
#> 
#> 3018Q3ZVOJCZJFDMPSFXATCQG04RAI :
#> "Five months ago, my niece and nephew were born.  They are my..."
#> 
#> 3MTMREQS4W44RBU8OMP3XSK8NMJAWZ :
#> "About a month ago I went to burning man. I was having a hard..."
#> 
#> 3018Q3ZVOJCZJFDMPSFXATCQG06AR3 :
#> "Burning Man metamorphoses was perfect. I am definitely still..."
#> 
#> [ reached max_ndoc ... 6,848 more documents ]