Quantifying Psychological Properties of Text
The average native English speaker has a functionally useful vocabulary size of over 20,000 words (Hellman, 2011). 20,000 words means 20,000 variables for psychologists to analyze. But that’s not all—words can be arranged in infinite orders, drastically changing their meanings in context. How do we capture all of this complexity in simple measurements that we can use for analysis?
In this unit, we introduce methods for quantifying psychological properties of text. We begin with fundamental skills for language processing in R, and move gradually to more advanced methods. In the final sections, we cover recent developments in the use of large language models for psychological research.
Throughout the unit, we will provide example analyses using the Hippocorpus dataset, which contains the data from Sap et al. (2020) and Sap et al. (2022). The dataset consists primarily of texts written by online participants, who were instructed to write either true stories that happened to them recently or fictional stories about a comparable topic. Participants then retold the true stories several months later. The dataset also includes a variety of psychological and demographic characteristics of the participants, including gender and openness to experience. The dataset itself is not included in the book’s files, but you can easily download it from the website and follow along.